Unbalanced Cluster I/O Operations

I have a 3 master/data nodes cluster + 1 kibana client node using elasticsearch version 7.17.4
The cluster is deployed on Nutanix.
I'm using auditbeat, packetbeat, filebeat and elastic agent as sources of data.
Recently the cluster become unstable and keep going down, what I noticed from stack monitoring is that right before the crash 1 specific node is always reaching high level of I/O operations comparing to the rest of nodes 5000/s comparing to 1000/s or less.
In all my configurations I was prudent to specify all 3 data nodes as Elasticsearch Output hosts, and from what I know beats always split charge on all outputs almost evenly.

Can you see any possible issue or path to track ?

I think there might be a slight misunderstanding in how Elasticsearch works (I've seen a similar issue before, so I'll try to explain what is going on).

Elasticsearch stores the data it ingests (documents) in Indices (indexes), but these indices are really backed by shards. That means that the documents ingested will be written to these backing shards. Elasticsearch distributes these shards across the nodes in the cluster to "balance" it. Elasticsearch unfortunately only knows of one (1) way to "balance" a cluster, via number of shards on a node. This means that Elasticsearch will always try to allocate new shards to the node with the least number of shards.

This can cause issues if one of your nodes is a few shards lower than your other nodes, as Elasticsearch can then schedule a lot of actively written to indices (and thus shards) onto the same node, causing it to receive a far high amount of activity then other nodes in the cluster. This is probably the issue you're seeing here. By default, all of the Elastic Beats products write to indices with a single shard backing it (and generally one (1) replica). So, what probably happened was one of the nodes in your cluster had a few shards less than the other nodes. When the Beats indices rolled over, they were mainly allocated to this one (1) node. Thus, this one (1) node is now receiving the majority of traffic, and therefore has a far higher load.

There are a few possible "solutions"/workarounds here:

  1. Change the default shard count for the Beats indices from one (1) to the number of nodes in your cluster. This would ensure that all nodes are actively written to, therefore the load will be evenly spread across all nodes.
    • This does have some minor downsides
      1. You will have more shards in your cluster (this has substantially been reduced as a downside over the last few releases as scalability with many shards has been greatly improved, so if you're on a newer release it shouldn't be as large of a downside).
      2. It requires modifying the default settings of Elastic products (again a very minor downside, but some people might not like/want to modify defaults)
  2. Split the data into more manageable indices. I see you mention the use of Beats, another option would be to use the Elastic Agent which provides the majority of the functionality as Beats but has some additional features. One of those "features" is that it uses the data stream convention, which inherently breaks down the data into smaller indices, which can then be more evenly distributed.
    • This also has some minor downsides
      1. Like option 1, this does result in more shards in the cluster overall
      2. Elastic Agent doesn't have all the modules/integrations that Beats does, so the functionality isn't 100% equal, meaning depending on your use case this may not be an option.
      3. There is still a chance you could be really unlucky and you could still have multiple high activity indices/shards allocated to the same node, causing a similar issue again. (But you'd probably need to be really unlucky).
  3. Increase the resources on your Elasticsearch nodes to be able to account for/handle being assigned many high active indices. While this might sound wrong, remember that Elasticsearch is intended to be highly available, so if you have a cluster of 3 nodes, you should really be able to handle having one of those nodes go offline and the other 2 nodes being able to take over the load of the node that is offline.
    • This does have one minor downside
      1. If you have really unbalanced index activity, you could end up needing significant number of resources that might be idle if they aren't the active node.

For me, when I was seeing this issue, I ended up switching most of the setup to option 2, Elastic Agent, which greatly reduced the number of times a single node in my cluster would be overburdened by many highly active indices.


Thnak you very much Ben, your answer is very explanatory
for this option

I did mention that I'm using elastic agent so I'm only using beats for missing features or special cases.
and for splitting data

even with beats I use lifecycle management to rollover small indices.

I think what I still have to try is to increase the number of shards
or maybe it's time to increase the resources, what I have now is surpaced.

Thank you.

I checked shards distribution and everything was looking great.
the issue ended by extending rams and now the cluster is stable.