Optimizing Large-Scale Elasticsearch Logging Cluster for Resilience and Stability

mazerunner · May 19, 2025, 9:07am

We operate a logging cluster with:

180 hot nodes (i3.4xlarge, 400TB, 3-month retention)
250 warm nodes (i3en.2xlarge, 700TB, 15-month retention)
24 router nodes (r5.2xlarge)
3 master eligible nodes (c5.metal)
Weekly indices with 1P+1R configuration
25K total shards, 1 trillion documents
150K events/sec ingest(hot nodes), 100K queries/sec(inc both hot and warm) on avg

We recycle all nodes every 30 days, but EC2 failures cause severe cluster instability due to network issues.

Are there any improvements we could make to our architecture, sharding strategy, or recovery procedures to enhance resilience and stability? Should we consider updating instance types, switching to EBS volumes instead of instance storage, or using HDD-based instances for warm nodes?

Any insights would be helpful, Thanks

DavidTurner · May 19, 2025, 11:59am

What version are you running?

i3 and i3en nodes are pretty good. What alternatives do you propose?

IO latency is usually bad for stability, so I'd say to steer clear of spinning disks entirely. EBS might work if it's well-provisioned but still not as stable as SSDs.

Most of the users I know at this sort of scale find it much cheaper to use searchable snapshots rather than keep the entire dataset on disk as you are doing.

DavidTurner · May 19, 2025, 12:08pm

What exactly do you mean by this? If an EC2 instance fails, usually the node just leaves the cluster and ES rebalances the work elsewhere. What "network issues" are you experiencing? How does the "cluster instability" manifest exactly?

mazerunner · May 20, 2025, 4:20am

Thanks David for the reply
ES v8.8*
What I meant by updating instance types was to implement larger instances (i3en.6xlarge) to reduce the total number of nodes. I believe the high node count may be the bottleneck in our current situation.

We don't yet have a comprehensive understanding of what's causing the cascading failure effect. However, it appears that when a single EC2 instance fails, it triggers frequent connection errors with other nodes, congesting the entire cluster. This congestion potentially delays state updates across the cluster, ultimately causing router nodes to throw timeouts on both read and write operations.

DavidTurner · May 20, 2025, 5:24am

I don't know of any reason why reducing the node count would help with cluster stability. Not that we know everything there is to know about clusters this large, but it'd be a bug if this mattered I think.

This isn't the behaviour I'd expect at all. The cluster should just disconnect and move on. Have you configured net.ipv4.tcp_retries2 correctly? By default Linux is ludicrously slow to notice a network partition.

Topic		Replies	Views
Stability issues with elasticsearch cluster Elasticsearch	6	1451	July 6, 2017
ElasticSearch - node instances Elasticsearch	5	1093	July 28, 2017
AWS EC2 based cluster best practices Elasticsearch	18	2842	August 13, 2020
ElasticSearch on Amazon EC2 tips Elasticsearch	4	1612	July 6, 2017
Elasticsearch cluster instability Elasticsearch	13	2880	July 6, 2017

Optimizing Large-Scale Elasticsearch Logging Cluster for Resilience and Stability

Related topics