Running ES v7.2.0.
We excluded data from node 'es-dbs-022' using cluster.routing.allocation.exclude._name because of some unrelated reasons. After some time we returned the node back to cluster using the same API (setting value to null) and after that the cluster has been rebalancing for few days (is usually done in 30min), and the rebalance itself is behaving very weirdly.
We have 60 data nodes, and one specific index has 60 shards (+60 replicas). Usually nodes have 1-3 shards from this index, but es-dbs-022 has ~30 of these shards. Other than this, the node has just a few shards that are not from this index.
This is very abnormal, usually after few hours of rebalancing the node has its disk full and blocks all writing. I tried looking into logs but nothing of importance is shown.
I've put TRACE log level for logger.org.elasticsearch.cluster.routing.allocation and in here only useful thing i found is that based on BalancedShardsAllocator, the node has negative weight in half of log records. No other nodes have negative weight.
One other potential clue is this 'THROTTLE' log (happens often, bfcIx0i_Spm_aatNSspIkg is es-dbs-022):
{"log":"{\"type\": \"server\", \"timestamp\": \"2019-07-22T13:16:12,653+0000\", \"level\": \"TRACE\", \"component\": \"o.e.c.r.a.a.BalancedShardsAllocator\", \"cluster.name\": \"es-research-cloud\", \"node.name\": \"es-dbm-001\", \"cluster.uuid\": \"Tnbn6gyVRUWU4p-m--4gIA\", \"node.id\": \"ATaVaYN6QZePAa5s5IMhsQ\", \"message\": \"Couldn't find shard to relocate from node [uDxmj5r8QEi-gvdSwu0HQw] to node [bfcIx0i_Spm_aatNSspIkg] allocation decision [THROTTLE]\" }\n","stream":"stdout","time":"2019-07-22T13:16:12.653302601Z"}
Anybody has any ideas? Thanks