Benchmarking ES cluster using larger Rally dataset for multiple parallel indexing

cks · June 6, 2019, 1:09am

The above post allowed me to create 10x indices for nyc_taxis dataset, however only the shards for one index were populated in parallel, then rolls over to next index and so on. Is there a way to populate all 10 indices in parallel?
env:

4x ES nodes, 512gb ram, 48 vcores, 30TB Flash NAS storage.
1-4x Rally nodes,( 3x load generators)
Track: nyc_taxis.
Heap is 31GB

bulk_indexing_client=40, num_of_shards=40, bulk_size=175000
I got about 792k docs/s for index-append running on 1x rally node and 4x ES nodes.
CPU saturation was 50-60% on each ES nodes. Increasing number of load generators did not seem increase docs/s number. I am trying to max out cpu and storage IO. Any other ways to stress test ES nodes?

Christian_Dahlqvist · June 6, 2019, 6:18am

What is the use-case you are benchmarking for? What type of data will you be indexing? Will you be using time-based indices? If so - do you have a specified retention period? What is the aim of your benchmark?

Given the size of the host I would probably recommend running multiple nodes per host.

cks · June 7, 2019, 7:06am

At this time looking to benchmark Elastic for high throughput indexing. The likely use case is apache logs. The record size in http_logs track seem to be light, hence was trying on nyc_taxis track. Not particularly looking at time based indices, may be in future. Retention is likely 6 months.
Aim is to maximize ElasticSearch performance on available physical environment. I had setup another 4 ES nodes with lesser Cpu and ram. 32 vcores and 256 GB RAM - index-append came down by 20% when compared with 48vcores and 512GB ram. heap was still 31 gb for both, all other params were same. Here is what I am trying:

find optimal configuration for full indexing (bulk) for cluster.
find optimal configuration for searches.
find optimal configuration for cluster with both search and indexing at same time.

Christian_Dahlqvist · June 7, 2019, 7:18am

For that type of data you almost always want to use time-based indices, e.g. through rollover together with ILM. This generally means that you are indexing into a few shards per node at most at any time. I would still recommend setting up multiple nodes per host. You could also look at the rally-eventdata-track which was designed to simulate this use case and then adapt challenges to fit you needs.

Christian_Dahlqvist · June 7, 2019, 7:35am

Also have a look at the following resources:

https://www.elastic.co/es/webinars/using-rally-to-get-your-elasticsearch-cluster-size-right

https://www.elastic.co/es/elasticon/conf/2016/sf/quantitative-cluster-sizing

system · July 5, 2019, 7:35am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Multiple indices are indexing in sequence Elasticsearch rally	2	734	May 7, 2019
Bulk is too slow Elasticsearch	34	16859	December 14, 2017
ElasticSearch Bulk indexing is not scaling Elasticsearch	7	2983	July 5, 2017
Scalability issue - Rally benchmark on ES 7.0.1 Elasticsearch rally	7	1232	July 2, 2019
Anyone with Petabyte indexing experience using parallel tasks? Elasticsearch	9	1183	May 25, 2017

Benchmarking ES cluster using larger Rally dataset for multiple parallel indexing

Related topics