One node taking much more space than others

Hi all,

I have a small cluster with 37 nodes (3 master and 34 data), the cluster had an unusual number of shards per node (~1200) which was wrong according to elasticsearch documentation.
Following the good standards recommendations, I have changed the number of shards for each index in order to fit no more than 600 shards per node.
This last setting is applied to new indices (the cluster creates new ones everyday and delete the ones with more than 30 days) so its still in progress and now I have ~700 shards per node.
The problem is that one of the nodes is taking much more data in terms of space (same number of shards as the other nodes), yesterday I took this node out of the cluster restarted it and joined it again so that the shards we balance again. This solution worked for 24h but today I noticed that I only have 300GB of disk free when on the other nodes I have around 600GB.
The outcome is that when this node reaches the watermark, the cluster starts to suffer and the queues increase causing the ingestion to fail.
Is there any way I can know why such big difference on data used ? or why it always happen on this node ? Any other advise ?

Thanks in advance

I do not think I'd call 34 data nodes * 700 shards per node a "small cluster" :slight_smile:

The first thing I would look for is whether there is a single oversized shard on that node, or whether the shards on that node are just generally larger than elsewhere. The GET _cat/shards API is a good place to look for this:

curl -s http://localhost:9200/_cat/shards?bytes=b | sort -k5 -n | sort -k8 -s

(obviously change the URL to point to one of your actual nodes)

Because you have so many shards it'll not be possible to share the full output here, but you will be able to put it on https://gist.github.com.

Which version are you using?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.