ElasicSearch cluster are crashing when conducting heavy aggregations

We currently have 6 ES data nodes, 1 ES master node, and 1 ES coordinating node in our setup. We are conducting load tests that generate millions of lines of data, which get fed from hundreds of server using Filebeat that directs the data to our 6 Logstash nodes and then eventually to our ES cluster.

the structure of the data is timeseries. We collect metrics such as response time, response codes, latency, etc..

Over the course of 2-3 hours our index size are 200+ gb and doc count are over 200million

Specs for each node:
Each index has 192 shards and each index represents 1 test.
ES data node - AWS m5.4xlarge (4 core , 16gb ram with 1tb of ssd mem)
ES coordinating node - AWS r5.2xlarge ( 8 core, 64 gb ram with 1 tb of ssd mem)
ES master node - AWS r5.2xlarge ( 8 core, 64 gb ram with 1 tb of ssd mem)

We are trying to get real-time data as the data is being ingested into ES, but eventually, the ES queries are taking longer as more data is coming in.

My question is, what is the best way to configure our ES and what are some steps to figure out the correct shard count for each index.

Why do you have so many shards? Have a look at this blog post for some shard size recommendations. If your index size is 200GB I would probably recommend using 6 primary shards, not 192.

You can also use the rollover API to generate indices with a target shard size and flexible time period. This is very useful if you have fluctuating indexing throughput.

You should also note that having a single dedicated master node in the cluster is bad - you should ideally have 3. As these should not hold data or serve an requests, they can be quite a lot smaller than the data nodes. I am surprised by the instance types you have chosen as I would always make the data nodes the most powerful as these do most of the work.

If you are using gp2 EBS it is worth noting that small volumes tend to have reasonably low levels of IOPS, not at all comparable to local SSDs. I would therefore recommend you monitor disk I/O and iowait as this often is the bottleneck in an Elasticsearch cluster.

1 Like

Thanks for the pointers, The reason why we have so many shards is because we were doing cardinality aggregations that were taking up a huge amount of resources , and increasing the shard count seemed to alleviate some of that. Ultimately we ended up changing how we do aggregations and never reverted the size of the shards. I'm hopeful either changing the shard count, beefing up the data nodes, or upgrading the IOPS on the data nodes may help solve our issue.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.