Uneven disk usage after removing a node set

I'm using ECK 1.8.0 and ES 7.16.2.

In order to increase the storage available for an ES cluster managed by the ECK operator, I proceeded through two steps, first I added a new node set with the new disk capacity I need, I waited for the shard relocations, then I removed the old node set, then waited for the shard relocation.
I did that for the data node set (composed of 3 nodes across 3 zones) as well as for the master node set (composed of 3 nodes across 3 zones)
I did it at the same time because I wanted to rename the master node set so I took the opportunity.

However, after relocating the shards, one data node is completely unbalanced even though the large shards are well balanced across the nodes. Here is the status
You can see that the node with the less shards is the one having the disk usage anomaly, I precise that there a single large index, and that index has 6 shards that are well balanced across nodes (2 on each node)

shards disk.indices disk.used disk.avail disk.total disk.percent host         ip           node
    70       77.3gb    77.5gb     50.2gb    127.8gb           60 10.208.63. 10.208.63. elastic-site-search-es-data140g-zoneb-0
    69       77.3gb    77.6gb     50.2gb    127.8gb           60 10.208.36.  10.208.36.  elastic-site-search-es-data140g-zonec-0
     4       76.2gb   113.5gb     14.3gb    127.8gb           88 10.208.59.  10.208.59.  elastic-site-search-es-data140g-zonea-0

Also, I can see this warning in the master nodes logs

{"type": "server", "timestamp": "2022-10-20T11:49:47,792Z", "level": "WARN", "component": "o.e.c.r.a.d.DiskThresholdDecider", "cluster.name": "elastic-site-search", "node.name": "elastic-site-search-es-master-zonea-0", "message": "after allocating [[members][2], node[LjEUE5PgTiqB65tygsCQAQ], [R], s[STARTED], a[id=b6KSlUKPSVaWSclfM5IelQ]] node [bCZMIAM6RkS6pqH_MEukjg] would have more than the allowed 10% free disk threshold (9.7% free), preventing allocation", "cluster.uuid": "ykOg819RTCKt73Ehvjylxg", "node.id": "ydGRqsigTbScRiwWZk9ztw"  }

What is taking that space in the over used node? What can I do to unblock the situation?


I decided to scale the set of nodes from one node per zone to two nodes per zone, to "unblock" the cluster and hope that it will be fixed when I scale down again later.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.