Rally consumes all cluster threads and crashes with small clusters


Using the latest Elastic Rally, we have a ECE deployment with a 8GB cluster (so 2 assigned cores). This in turn means a low number of threads for bulk and search. We are running esrally geonames against that cluster as part of our sizing and benchmarking phase, and after running for a while Rally suddenly gets many 400 errors which seem to correlate with consuming all available threads (thread pool rejects).

How does Rally tune it's testing speed against a given cluster? maybe it has issues scaling down to very small clusters?


Internally, Rally uses the default Elasticsearch Python client and it uses that client to issue bulk requests. By default, Rally will issue requests as fast as it can. In your case (geonames), Rally will use a bulk size of 5000 docs/s and 8 clients. Rally cannot "know" what you want to measure and thus does not have any backoff logic like a normal client would do (and sometimes this is handy, see e.g. the blog post Why am I seeing bulk rejections in my Elasticsearch cluster?).

There is also a mode in Rally where you can define a target throughput and Rally will aim to achieve it (that depends whether Elasticsearch can achieve that throughput), see also the Rally FAQ. It is primarily meant for benchmarking operations where you're interested in a specific latency (e.g. searches) instead of batch operations (e.g. bulk indexing).

In your case you are probably interested in finding the breaking point and want to avoid bulk rejections. I suggest two things:

  • You can change the bulk size of the track to e.g. 500 documents with --track-params="bulk_size:500" (see the geonames track README). We do not expose the number of indexing clients yet as parameter although it would be possible.
  • Bulk rejections (and any other errors) get recorded by Rally and if you use a dedicated metrics store you can inspect those in more detail. However, in your case I have the impression that you want to treat a bulk rejection as a fatal error and thus you could add the parameter --on-error=abort so Rally will treat any HTTP error as fatal and abort the benchmark immediately.

Thanks Daniel, it seems like we will be testing with a larger cluster for now, but having a configurable number of threads will be awesome!

Hi @Itamar_Syn_Hershko,

you can now override the number of bulk indexing clients with the track parameter bulk_indexing_clients.

Example: Set the number of bulk indexing clients to 2 with --track-params="bulk_indexing_clients:2"

This is a change in Rally's default tracks which get usually automatically updated from Github (unless you are offline) so this should just work out of the box as of now.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.