How do I setup my cluster to maximize CPU util % while testing with Rally

Hi,

I have a cluster setup with the objective to test elasticsearch nodes using rally. As of now I have a benchmark coordinator, 2 load drivers and 3 elasticsearch nodes. I wish to maximize the load I can place upon these 3 nodes. Looking for recommendation on how to setup the cluster to have peak throughput and CPU utilization.

Akshay

Hello,

This question is a bit open ended without a definition of load (ingest? search?), other parameters like architecture, mappings etc. and understanding of what you want to achieve by benchmarking.

Please take a look at the Using Rally to Get Your Elasticsearch Size Right webinar and the Benchmarking Elasticsearch with Rally presentation linked in the docs page to get you started.

Dimitris

Hi Dimitrios,

I am running performance benchmarks using rally. I am running my cluster on AWS M5a.4xlarge. I wish to apply load from the default rally races(currently using geopoint) on my elasticsearch nodes to maximize CPU util and throughput(docs/s). This exercise is simply to test elasticsearch on this AWS instance to see how well it performs.

The setup consists of the following
1 - benchmark coordinator
2 - load drivers
& 1 to 4 - Elasticsearch nodes.
Aim - To maximize cpu utilization and throughput as I scale up from one node to four.
Problem - Unable to increase CPU utilization of elasticsearch nodes..

Elasticsearch performance and throughput is generally not necessarily limited by CPU, but rather disk I/O. What type and size of EBS storage do you have on your nodes? Have you monitored disk utilization and iowait while running the benchmark to see if this might be limiting performance?

What type of instance are you using for Rally?

All systems are M5a.4xlarge including rally. 16 cores, 64 GB RAM with 60000 PIOPS. iowait doesn't go higher than ~1% and the CPU utilization is roughly at 70%.

What indexing rate are you seeing? Which track are you using? How many clients is Rally using?

For nodes that powerful a lot of the default tracks are quite small. You may want to try the rally-eventdata-track which generates data on the fly. It is more CPU intensive but can be run for long periods of time at reasonably high concurrency without resetting.

esrally --track=geopoint --load-driver-hosts=172.31.46.229,172.31.39.65,172.31.42.222 --target-hosts=172.31.44.111:9200 --pipeline=benchmark-only --distribution-version=6.5.2 --challenge=append-fast-with-conflicts --track-params='bulk_indexing_clients:8' --track-params='refresh_interval:1s' --track-params='number_of_shards:1' --track-params='number_of_replicas:0'

Metric Task Value Unit
Cumulative indexing time of primary shards 74.4257 min
Min cumulative indexing time across primary shards 0 min
Median cumulative indexing time across primary shards 0.000216667 min
Max cumulative indexing time across primary shards 74.2677 min
Cumulative indexing throttle time of primary shards 0 min
Min cumulative indexing throttle time across primary shards 0 min
Median cumulative indexing throttle time across primary shards 0 min
Max cumulative indexing throttle time across primary shards 0 min
Cumulative merge time of primary shards 9.73118 min
Cumulative merge count of primary shards 45
Min cumulative merge time across primary shards 0 min
Median cumulative merge time across primary shards 0 min
Max cumulative merge time across primary shards 9.60467 min
Cumulative merge throttle time of primary shards 2.23517 min
Min cumulative merge throttle time across primary shards 0 min
Median cumulative merge throttle time across primary shards 0 min
Max cumulative merge throttle time across primary shards 2.23517 min
Cumulative refresh time of primary shards 3.6826 min
Cumulative refresh count of primary shards 371
Min cumulative refresh time across primary shards 0 min
Median cumulative refresh time across primary shards 0.00101667 min
Max cumulative refresh time across primary shards 3.61537 min
Cumulative flush time of primary shards 8.16592 min
Cumulative flush count of primary shards 27
Min cumulative flush time across primary shards 0 min
Median cumulative flush time across primary shards 0.000766667 min
Max cumulative flush time across primary shards 8.16158 min
Total Young Gen GC 87.926 s
Total Old Gen GC 0.171 s
Store size 3.25239 GB
Translog size 0.725253 GB
Heap used for segments 12.1751 MB
Heap used for doc values 0.0459023 MB
Heap used for terms 10.4095 MB
Heap used for norms 0 MB
Heap used for points 0.826014 MB
Heap used for stored fields 0.893669 MB
Segment count 57
Min Throughput index-update 72477.7 docs/s
Median Throughput index-update 85770.4 docs/s
Max Throughput index-update 96098.6 docs/s
50th percentile latency index-update 408.759 ms
90th percentile latency index-update 624.843 ms
99th percentile latency index-update 1304.19 ms
99.9th percentile latency index-update 18652.4 ms
100th percentile latency index-update 34321.7 ms
50th percentile service time index-update 408.623 ms
90th percentile service time index-update 624.682 ms
99th percentile service time index-update 1303.06 ms
99.9th percentile service time index-update 18652.4 ms
100th percentile service time index-update 34321.7 ms
error rate index-update 0 %

Have you tried increasing this? It sounds a bit low.

I've tried 100, but it doesn't seem to make much of a difference(outside the regular variability)

As @Christian_Dahlqvist mentioned, typically it's IO that takes a bigger hit than cpu for ingest and search.

With ingestion, when you are certain that the load driver isn't the bottleneck, apart from tweaking the number of clients you can experiment with increasing the bulk_size (which for the geopoint track you are currently using is 5000 by default). Higher throughput (and load) can be achieved by increasing the number of shards; we've recently had a 3 node setup with another track and trying to explain the modest indexing throughput we were able to get higher results by increasing the number of primary shards.

Note that the geopoint track you are using has a rather small corpus, from the default tracks with pre-generated data you could try the nyc_taxis track that has a large corpus and uses a bigger bulk size by default.

Finally, depending on the use case, the default compression can be changed to best_compression (see index.codec setting); this is for example what the eventdata track uses. Best compression will also increase cpu usage and can be attractive for logging use cases where disk costs need to be kept under control.

Many of the common pitfalls and topics touched here are discussed in the links I mentioned in my previous comment.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.