Rally consumes all cluster threads and crashes with small clusters

Itamar_Syn_Hershko · January 30, 2018, 10:21am

Heya!

Using the latest Elastic Rally, we have a ECE deployment with a 8GB cluster (so 2 assigned cores). This in turn means a low number of threads for bulk and search. We are running esrally geonames against that cluster as part of our sizing and benchmarking phase, and after running for a while Rally suddenly gets many 400 errors which seem to correlate with consuming all available threads (thread pool rejects).

How does Rally tune it's testing speed against a given cluster? maybe it has issues scaling down to very small clusters?

Thanks!

danielmitterdorfer · January 30, 2018, 10:46am

Internally, Rally uses the default Elasticsearch Python client and it uses that client to issue bulk requests. By default, Rally will issue requests as fast as it can. In your case (geonames), Rally will use a bulk size of 5000 docs/s and 8 clients. Rally cannot "know" what you want to measure and thus does not have any backoff logic like a normal client would do (and sometimes this is handy, see e.g. the blog post Why am I seeing bulk rejections in my Elasticsearch cluster?).

There is also a mode in Rally where you can define a target throughput and Rally will aim to achieve it (that depends whether Elasticsearch can achieve that throughput), see also the Rally FAQ. It is primarily meant for benchmarking operations where you're interested in a specific latency (e.g. searches) instead of batch operations (e.g. bulk indexing).

In your case you are probably interested in finding the breaking point and want to avoid bulk rejections. I suggest two things:

You can change the bulk size of the track to e.g. 500 documents with --track-params="bulk_size:500" (see the geonames track README). We do not expose the number of indexing clients yet as parameter although it would be possible.
Bulk rejections (and any other errors) get recorded by Rally and if you use a dedicated metrics store you can inspect those in more detail. However, in your case I have the impression that you want to treat a bulk rejection as a fatal error and thus you could add the parameter --on-error=abort so Rally will treat any HTTP error as fatal and abort the benchmark immediately.

Itamar_Syn_Hershko · February 4, 2018, 3:46pm

Thanks Daniel, it seems like we will be testing with a larger cluster for now, but having a configurable number of threads will be awesome!

danielmitterdorfer · February 5, 2018, 8:23am

Hi @Itamar_Syn_Hershko,

you can now override the number of bulk indexing clients with the track parameter bulk_indexing_clients.

Example: Set the number of bulk indexing clients to 2 with --track-params="bulk_indexing_clients:2"

This is a change in Rally's default tracks which get usually automatically updated from Github (unless you are offline) so this should just work out of the box as of now.

system · March 5, 2018, 8:32am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
A question for result benchmark Elasticsearch rally	2	791	March 27, 2017
Bulk-update challenge tuning Elasticsearch rally	3	421	December 17, 2020
ThreadPool Setting's for bulk indexing in elasticsearch.yml Elasticsearch	5	8606	July 5, 2017
Bulk api queue becomes full Elasticsearch	4	3694	July 5, 2017
CPU utilization is too low while indexing Elasticsearch rally	6	1200	May 24, 2019

Rally consumes all cluster threads and crashes with small clusters

Related topics