Distributed Load Testing at Scale
Introduction
Load testing is an important part of non functional testing activity to determine the system bottleneck in terms of variety of factors like cpu, memory, network bandwidth, read/write latencies, response times & plethora of other metrics. This not only helps you to make your system deal with large number of concurrent users but also help you to determine the how much resources your current infra supports and autoscaling policies in case of intermittent spikes. As you add more and more virtual users/threads into the system, you need to have more load generators and don't want to choke load generators for resources.
Problem Statement
If you quickly want to load test your backend systems , you would probably spin up jmeter or bring up gatling container and start testing with your scripts. This is okay for any small system or even a fairly large systems but the problem arrives when you want to load test with a million users and need to horizontally scale your load generators. Here's what at …, we are trying to solve :
- Provide a infrastructure or tooling to carry out load tests without worrying about scale of virtual users.
- Realtime Visualisation of the tests triggered.
- Running any number of load tests concurrently.
- Summary & Detailed Reporting at the end of tests.
- CI-CD gating with performance tests.
- Inbuilt Profiler to profile your pods on k8s.
- Cloud Native/ VM & BYOVM for load generation.
- Load Tool agnostic solution : Teams at … uses different tools & scripts.
- We at …. , have built a tool to solve the above challenges and are. continuously improving it to adhere to the changing tech trends.
Architecture Diagram

At … , our system for load tool consists of the below components :
- User Interface, we provide teams to trigger their tests via UI which asks information from user in terms of git url for the load scripts & few other configurations.
- Scripts & Config Files, Users are asked to build out global & tool agnostic json file where in he/she can provide all the load related configs like duration of tests/api endpoints/ramp ups/virtual users. We have a strict schema & has to be followed by the users.
- Validator Component/Service, This component takes care of validation of user scripts and configs which helps us to verify scripts & configuration for any schema or data validation. This is an entry point to our backend system which is triggered via UI.
- Worker, This is our main component which is responsible for triggering load test runs, determining the number of executors as well as creation of executors on different load gen clusters. For each test triggered we instantiate a worker until the test is completed.
- Executors, are baked in containers of load tools like jmeter/gatling or any other load tools and are responsible to run load tests, send data points to time series database for aggregation at the end of the test.
- Reporting Service, is responsible to aggregate data points across executors and build final report which is sent to user via slack & email configured. Executors push data points to kairos database and final reports are pushed to a no sql database.
- Visualisation, We also provide live reporting for each test triggered via our UI with an embedded Grafana, where user can come and see the progress of the test triggered.
- Metadata Service, We also keep test runs for different teams/projects onboarded to our load tool.
- CI-CD Integration, We also provide custom plugin of the tool to be integrated to our in house ci-cd platform and can be used to gate deployments based on user defined parameters.
- Pod Profiler, is responsible for providing pod suggestions during daily & holiday peak to the application teams.
- Kafka Integration Jars, we also provide ways to load test kafka producers & samplers using our custom jars which can be included as part of configs while doing a kafka load test.
- Helper Service [Job Service] provides support of creating executors & workers at runtime across different clusters in k8s.
Tool Usage & Statistics
At …, application teams are constantly doing load tests of the backend systems and the tool is used by more than 1000+ teams.
~ runs on our load tool during holiday period spanned over a month.
~ new projects were onboarded.
Road Ahead
- Working towards enabling new load tools like locust as of now we support jmeter & gatling.
- Making reporting & analysis more intelligent using machine learning & data science.
- Addition of JVM profiling as part of tool to provide one stop solution for all.
- Constantly working on improvements towards providing users with low latency & high throughput scalable load tool.