We recently upgraded from ES 5.4 to ES 7.7 and are facing this weird issue in production where relocation goes in a loop. Have noticed majorly on large clusters and having hot/cold config. e.g. following is cluster config
3 master nodes
2 nodes with node.attr.tag as hot
30 nodes with node.attr.tag as cold (relocation in loop on these nodes)
Indices having hot and cold inclusion/exclusion in settings
For the sake of debugging and simplifying things, added following in cluster settings
cluster_concurrent_rebalance: 10 node_concurrent_recoveries: 1 balance.index: 0.0f balance.shards: 1.0f
Enabled trace logs for BalancedShardsAllocator and went through code and noticed that the first time it relocates the shard to the minNode from maxNode and then saw simulate logs which increase number of shards on that minNode on the model due to throttling. So, in next indices run, the shards on minNode increase in theory and that leads to shards moving away from minNode even if (in actual) it has less shards. Attaching logs for same.
Logic and code wise this is the case with ES 5.4 as well but we never faced this issue there. Could someones please help on how can I debug this and also why do we simulate relocation and theoretically increase shards on node?
Ubuntu Pastebin - Node RlZgpH1xTYKuEjn1gyH3CA is the one which has least shards when i enable rebalance shards.