Any limitations with distributed load-drivers?

Hi (again),

We are trying to use rally to put max load on our cluster and I just want to rule out rally as a potential bottleneck for our current issue, we are unable to go above 105.000:ish docs per second on our infrastructure.

Problem: Our throughput does not scale at all above approx 100.000 docs/sec

Question: Do you see any limitations in rally here so rally does not generate enough load? esrally nodes does not seem to be under much pressure (fairly low cpu and load) so my guess is that it is not rally.

We are running rally-eventdata-track append-no-conflicts. Basically I need to rule out all uncertain things when I talk to the infrastructure team and complain about the infrastructure.

We are running vmware, SAN, 64gb mem (30 HEAP), 8 cores. Elasticsearch 5.6.2, esrally 0.7.4.


We have 0 replicas for all tests.

  • One esrally node, 2 data-nodes and 6 shards we get 85.000 docs/sec (this is good!)

  • Three esrally nodes + 1 load-balancer node + 3 data nodes -> approx 105.000 docs/sec (this is ok!)

  • Three esrally nodes + 2 load-balancer node + 6 data nodes -> approx 105.000 docs/sec (this is catastrophic!)

Any help is appreciated.

Have you tried sending traffic directly to the data nodes in order to rule out that the coordinating-only nodes are the bottleneck? Do you have monitoring in place? What does disk I/O and iowait look like on the nodes as you scale out (your SAN is serving all nodes, so could be a bottleneck)? Are you able to test with local disk instead of the SAN to see if that makes any difference?

Thanks a lot for you answer!

Yes, I have done most of that, just let me get back with details, will have to dig into our metrics.

And YES, SAN and perhaps also vmware could definately be the bottleneck but I have to rule out all other possibilities.

I have tried to run this without the LB nodes but I will have to try that again...

We are unfortunately NOT able to test with local disk, at least not within reasonable time.

Some metrics:

Two nodes

This displays the throughput of running 2 nodes and increasing the number of shards 2,4,6,8,10 and also number of replicas from 0,1, Each iteration runs for 30 minutes.

  • 2 shards, 0 replicas
  • 2 shards, 1 replica
  • 4 shards, 0 replica
  • and so on...


This is usage_iowait (assuming in percent) during the same interval. IOWait is definately increasing while throughput is not.

3 + 6 Nodes
Below is first the throughput for 3 nodes and later the throughput for 6 nodes.


And here is the iowait for the same time interval. Not really sure how to interpret this. There are some spikes at the end but that is probably another problem...


But remember, I just wanted to rule out rally as beeing the bottleneck :wink:
We are aware of that it is not recommended to run on SAN.

regards /Johan

Another thing that is shared across the cluster is typically networking, and I have seen that be the limiting factor in some indexing heavy use cases. What type of networking do you have in place?

Hi Johan,

One option to stress Rally is to "simulate" Elasticsearch's _bulk endpoint with a static response, e.g. by using nginx.

You need to install the "more-include headers" module and then you can do something like this in your nginx config:

server {
        listen 19200 default_server;
        listen [::]:19200 default_server;

        default_type  application/json;

        root /var/www/html;

        index index.json

        server_name _;

        location / {
                # First attempt to serve request as file, then
                # as directory, then fall back to displaying a 404.
                try_files $uri $uri/ =404;

        location /_bulk {
            if ($request_method = POST ) {
                more_set_headers "Content-Type: application/json; charset=UTF-8";
                  return 200 '{"took":514,"errors":false,"items":[{"index":{"_index":"myindex","_type":"mytype","_id":"1","_version":1,"_shards":{"total":1,"successful":1,"failed":0},"created":true,"status":201}}]}';

However, Rally is doing a few more operations (cluster health check, create index etc.) and your simplest option is probably to use nginx as a reverse proxy for one of your Elasticsearch nodes for all operations but _bulk (but I don't have a config snippet ready for this).

The second thing that you could try, is to run Rally with --enable-driver-profiling(see docs) but the better option for you is probably to stub out the _bulk endpoint.


1 Like

BUT... since I get approx 80.000 docs from one rally client, this is enough proof that rally works properly. The only thing I was thinking about was if there was some kind of limiting factor in running them in distributed mode or if there where some kind of default rate limit thingy.

I will try to start 3 rally nodes in local mode and measure total throughput and see if I get the same total throughput and if all clients report equal throughput.

Does this make sense?

Regards /Johan


No, Rally does not rate limit anything in distributed mode. It works as follows: We have one component that is coordinating the clients. This component is always running on the same machine where you invoke the Rally binary. If you choose to distribute the load test driver, only the processes that actually generate load (the clients) are run on the remote machines. Rally will assign clients in a round-robin fashion to the load driver hosts.

By "local mode" you mean that you start all three of them "on the same machine" or do you mean that you once run in distributed mode with three load driver hosts and once run the same benchmark (independently) from three machines (without them knowing about each other)?

One problem that I ran into with multi-node benchmarks was network throughput. Our nightly benchmarks run against up to three Elasticsearch nodes and initial tests showed that with the "PMC" track we would saturate the 1GBit network link (I measured this with ifstat). That's the reason our nightly benchmarks run now in a switched 10GBit network since we do multi-node benchmarks.


1 Like

A little update on this...

I started playing with client count and that seems to do a big difference. When using distributed load drivers, is this the total number of clients or is this the number of clients on each machine? In the log on each machine it seems like it has this number of LoadDrivers but I am not sure.

Regards /Johan

Ok, some results from todays tests. We have doubled the throughput!

Running 3 rally clients (different machines), 2 ES loadbalancer nodes, 4 datanodes.
As I said earlier, I am using eventdata track and append-no-conflicts challenge.

Rally reports a throughput of > 200.000 docs/sec which is awesome

  • 3 rally client machines
  • rally clients count (in track.json) is currently 132 (been increasing and seeing better and better throughput).
  • Using 12 shards (3 on each machine).
  • 2 loadbalancer (coordinating) nodes
  • 4 datanodes

Will continue to add nodes and see how it scales.

The track specification is independent whether the load driver runs on one or multiple hosts. In other words: It's the total number of clients.

Glad that you managed to scale it. Although I have to admit 132 clients indexing at maximum speed concurrently is quite a high number. :slight_smile:

Thanks! The client count is probably not optimal... will dig into that :wink:
We are happy with our tests, would not have been able to do it without rally.

One thing though, the reported throughput with distributed load-drivers sometimes reported a higher value than actual throughput. This is especially true when cluster is under high load. I can get back with some samples later.
I have not looked into how this value is calculated.

Thanks for your positive feedback on Rally. That's much appreciated. :slight_smile:

That is an interesting data point. How did you compare that? Rally's throughput calculation is based on the service time samples. It collects these samples from all clients, puts them into one second buckets and derives the throughput from that. If you are interested, you can check the respective code.

It got a bit more complex recently as Rally is now streaming the data but I verified that the calculation stays absolute identical to before by running both calculation approaches in parallel.

I could imagine that the numbers are a bit off due to different calculation approaches (between Rally and whatever you took as your baseline) though. Anyway, that's maybe worth having a closer look so I'd be very interested in any updates you have.


Hi @JohanRask,

I spent some time over the last days to improve indexing throughput of the eventdata track. In our test environment (4 cores, 32GB RAM, 10GBit network) I have seen the indexing throughput improve with the challenge append-no-conflicts from 62.000 docs/s (median) to 260.000 docs/s (median) in a stress test, i.e. more than 4 times.

So you should need a lot less clients now to achieve the same throughput. However, each client will precalculate some data and this will increase its memory usage. In my experiments each client needs roughly 200 MB of main memory now so you should have enough memory head room to not introduce a bottleneck accidentally.

Should you rerun your experiments I'd be curious on your experiences.


Nice work!
No plans, but it is very likely that I will run them again fairly soon. Let me get back to you then.


This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.