Optimizing Large-Scale Elasticsearch Logging Cluster for Resilience and Stability

We operate a logging cluster with:

  • 180 hot nodes (i3.4xlarge, 400TB, 3-month retention)
  • 250 warm nodes (i3en.2xlarge, 700TB, 15-month retention)
  • 24 router nodes (r5.2xlarge)
  • 3 master eligible nodes (c5.metal)
  • Weekly indices with 1P+1R configuration
  • 25K total shards, 1 trillion documents
  • 150K events/sec ingest(hot nodes), 100K queries/sec(inc both hot and warm) on avg

We recycle all nodes every 30 days, but EC2 failures cause severe cluster instability due to network issues.

Are there any improvements we could make to our architecture, sharding strategy, or recovery procedures to enhance resilience and stability? Should we consider updating instance types, switching to EBS volumes instead of instance storage, or using HDD-based instances for warm nodes?

Any insights would be helpful, Thanks

What version are you running?

i3 and i3en nodes are pretty good. What alternatives do you propose?

IO latency is usually bad for stability, so I'd say to steer clear of spinning disks entirely. EBS might work if it's well-provisioned but still not as stable as SSDs.

Most of the users I know at this sort of scale find it much cheaper to use searchable snapshots rather than keep the entire dataset on disk as you are doing.

What exactly do you mean by this? If an EC2 instance fails, usually the node just leaves the cluster and ES rebalances the work elsewhere. What "network issues" are you experiencing? How does the "cluster instability" manifest exactly?

Thanks David for the reply
ES v8.8*
What I meant by updating instance types was to implement larger instances (i3en.6xlarge) to reduce the total number of nodes. I believe the high node count may be the bottleneck in our current situation.

We don't yet have a comprehensive understanding of what's causing the cascading failure effect. However, it appears that when a single EC2 instance fails, it triggers frequent connection errors with other nodes, congesting the entire cluster. This congestion potentially delays state updates across the cluster, ultimately causing router nodes to throw timeouts on both read and write operations.

I don't know of any reason why reducing the node count would help with cluster stability. Not that we know everything there is to know about clusters this large, but it'd be a bug if this mattered I think.

This isn't the behaviour I'd expect at all. The cluster should just disconnect and move on. Have you configured net.ipv4.tcp_retries2 correctly? By default Linux is ludicrously slow to notice a network partition.