How do I setup my cluster to maximize CPU util % while testing with Rally

Akshay_Raj · October 15, 2019, 11:53pm

Hi,

I have a cluster setup with the objective to test elasticsearch nodes using rally. As of now I have a benchmark coordinator, 2 load drivers and 3 elasticsearch nodes. I wish to maximize the load I can place upon these 3 nodes. Looking for recommendation on how to setup the cluster to have peak throughput and CPU utilization.

Akshay

dliappis · October 16, 2019, 6:15am

Hello,

This question is a bit open ended without a definition of load (ingest? search?), other parameters like architecture, mappings etc. and understanding of what you want to achieve by benchmarking.

Please take a look at the Using Rally to Get Your Elasticsearch Size Right webinar and the Benchmarking Elasticsearch with Rally presentation linked in the docs page to get you started.

Dimitris

Akshay_Raj · October 16, 2019, 6:27pm

Hi Dimitrios,

I am running performance benchmarks using rally. I am running my cluster on AWS M5a.4xlarge. I wish to apply load from the default rally races(currently using geopoint) on my elasticsearch nodes to maximize CPU util and throughput(docs/s). This exercise is simply to test elasticsearch on this AWS instance to see how well it performs.

The setup consists of the following
1 - benchmark coordinator
2 - load drivers
& 1 to 4 - Elasticsearch nodes.
Aim - To maximize cpu utilization and throughput as I scale up from one node to four.
Problem - Unable to increase CPU utilization of elasticsearch nodes..

Christian_Dahlqvist · October 16, 2019, 6:57pm

Elasticsearch performance and throughput is generally not necessarily limited by CPU, but rather disk I/O. What type and size of EBS storage do you have on your nodes? Have you monitored disk utilization and iowait while running the benchmark to see if this might be limiting performance?

What type of instance are you using for Rally?

Akshay_Raj · October 16, 2019, 7:42pm

All systems are M5a.4xlarge including rally. 16 cores, 64 GB RAM with 60000 PIOPS. iowait doesn't go higher than ~1% and the CPU utilization is roughly at 70%.

Christian_Dahlqvist · October 16, 2019, 7:45pm

What indexing rate are you seeing? Which track are you using? How many clients is Rally using?

For nodes that powerful a lot of the default tracks are quite small. You may want to try the rally-eventdata-track which generates data on the fly. It is more CPU intensive but can be run for long periods of time at reasonably high concurrency without resetting.

Akshay_Raj · October 16, 2019, 7:46pm

esrally --track=geopoint --load-driver-hosts=172.31.46.229,172.31.39.65,172.31.42.222 --target-hosts=172.31.44.111:9200 --pipeline=benchmark-only --distribution-version=6.5.2 --challenge=append-fast-with-conflicts --track-params='bulk_indexing_clients:8' --track-params='refresh_interval:1s' --track-params='number_of_shards:1' --track-params='number_of_replicas:0'

Metric	Task	Value	Unit
Cumulative indexing time of primary shards		74.4257	min
Min cumulative indexing time across primary shards		0	min
Median cumulative indexing time across primary shards		0.000216667	min
Max cumulative indexing time across primary shards		74.2677	min
Cumulative indexing throttle time of primary shards		0	min
Min cumulative indexing throttle time across primary shards		0	min
Median cumulative indexing throttle time across primary shards		0	min
Max cumulative indexing throttle time across primary shards		0	min
Cumulative merge time of primary shards		9.73118	min
Cumulative merge count of primary shards		45
Min cumulative merge time across primary shards		0	min
Median cumulative merge time across primary shards		0	min
Max cumulative merge time across primary shards		9.60467	min
Cumulative merge throttle time of primary shards		2.23517	min
Min cumulative merge throttle time across primary shards		0	min
Median cumulative merge throttle time across primary shards		0	min
Max cumulative merge throttle time across primary shards		2.23517	min
Cumulative refresh time of primary shards		3.6826	min
Cumulative refresh count of primary shards		371
Min cumulative refresh time across primary shards		0	min
Median cumulative refresh time across primary shards		0.00101667	min
Max cumulative refresh time across primary shards		3.61537	min
Cumulative flush time of primary shards		8.16592	min
Cumulative flush count of primary shards		27
Min cumulative flush time across primary shards		0	min
Median cumulative flush time across primary shards		0.000766667	min
Max cumulative flush time across primary shards		8.16158	min
Total Young Gen GC		87.926	s
Total Old Gen GC		0.171	s
Store size		3.25239	GB
Translog size		0.725253	GB
Heap used for segments		12.1751	MB
Heap used for doc values		0.0459023	MB
Heap used for terms		10.4095	MB
Heap used for norms		0	MB
Heap used for points		0.826014	MB
Heap used for stored fields		0.893669	MB
Segment count		57
Min Throughput	index-update	72477.7	docs/s
Median Throughput	index-update	85770.4	docs/s
Max Throughput	index-update	96098.6	docs/s
50th percentile latency	index-update	408.759	ms
90th percentile latency	index-update	624.843	ms
99th percentile latency	index-update	1304.19	ms
99.9th percentile latency	index-update	18652.4	ms
100th percentile latency	index-update	34321.7	ms
50th percentile service time	index-update	408.623	ms
90th percentile service time	index-update	624.682	ms
99th percentile service time	index-update	1303.06	ms
99.9th percentile service time	index-update	18652.4	ms
100th percentile service time	index-update	34321.7	ms
error rate	index-update	0	%

Christian_Dahlqvist · October 16, 2019, 7:49pm

Have you tried increasing this? It sounds a bit low.

Akshay_Raj · October 16, 2019, 9:58pm

I've tried 100, but it doesn't seem to make much of a difference(outside the regular variability)

dliappis · October 17, 2019, 6:17am

As @Christian_Dahlqvist mentioned, typically it's IO that takes a bigger hit than cpu for ingest and search.

With ingestion, when you are certain that the load driver isn't the bottleneck, apart from tweaking the number of clients you can experiment with increasing the bulk_size (which for the geopoint track you are currently using is 5000 by default). Higher throughput (and load) can be achieved by increasing the number of shards; we've recently had a 3 node setup with another track and trying to explain the modest indexing throughput we were able to get higher results by increasing the number of primary shards.

Note that the geopoint track you are using has a rather small corpus, from the default tracks with pre-generated data you could try the nyc_taxis track that has a large corpus and uses a bigger bulk size by default.

Finally, depending on the use case, the default compression can be changed to best_compression (see index.codec setting); this is for example what the eventdata track uses. Best compression will also increase cpu usage and can be attractive for logging use cases where disk costs need to be kept under control.

Many of the common pitfalls and topics touched here are discussed in the links I mentioned in my previous comment.

system · November 14, 2019, 6:17am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
ES Benchmark using rally to stress a 2 node setup Elasticsearch rally	6	2564	November 8, 2018
How to do Performance Benchmark Elasticsearch on a Large Single Node Bare metal server? Elasticsearch rally	5	2341	January 16, 2019
Any limitations with distributed load-drivers? Elasticsearch rally	15	1968	December 27, 2017
Benchmarking ES cluster using larger Rally dataset for multiple parallel indexing Elasticsearch rally	5	1114	July 5, 2019
Need Help With Analyzing Rally Benchmark Report Elasticsearch rally	3	649	March 6, 2020

How do I setup my cluster to maximize CPU util % while testing with Rally

Related topics