The es-cluster warm-node rebalance tasks keeps going and some node shards decrease for long time

yitiao_feiyu · June 1, 2022, 2:33am

version ：7.17
The es-cluster warm-node rebalance tasks keeps going and not stop, warm node 's index comes from hot-node by ILM policy control, I didn't update warm index directly. I hava snapshot task, but I think it's not the cause of rebalane keeps goling.
I tried reduce or enlarge the cluster_concurrent_rebalance and wait but the reblanace is still didn't stop.
Thre wired thing is the warm-20 and warm-15's disk space and shard keep decrease for 8 hours , snapshot bellow:

I open the allocator trace log, found that warm-20 as move source node , the index weight is more than 30(it's not resonable), but warm-20 as move target node is negative or less then 5, I calculate myself the weight 30+ mabe wrong. the warm-20's shards is very less compire other's warm node. I think bad weight value is the cause of rebalancing issue. so what is the reason behind this? see below logging snapshot.
node_id "WvAX8WYTQiGIYdFdosEBjw" is the name of "warm-20"

code ： /elasticsearch-7.17.3-sources.jar!/org/elasticsearch/cluster/routing/allocation/allocator/BalancedShardsAllocator.java:552
private void balanceByWeights() { ...."Balancing from node [{}] weight: [{}] to node [{}] weight: [{}] delta: [{}]",

my cluster setting
GET /_cluster/settings "persistent" : {
"cluster.routing.alloc "cluster.routing.alloc "cluster.routing.alloc "cluster.routing.allocation.enable" },
"transient" : {
"cluster.routing.alloc "cluster.routing.alloc "cluster.routing.alloc "cluster.routing.alloc "cluster.routing.alloc ....
current GET _cat/allocation?v= node logging-hot-11 150 logging-hot-12 151 logging-hot-8 151 logging-hot-19 151 logging-hot-15 151 logging-hot-18 151 logging-hot-1 152 logging-hot-9 logging-hot-5 152 logging-hot-17 logging-hot-2 152 logging-hot-0 152 logging-hot-7 152 logging-hot-14 153 logging-hot-3 153 logging-hot-16 153 logging-hot-6 154 logging-hot-10 154 logging-hot-4 154 logging-hot-13 154 logging-warm-20 165 logging-warm-15 205 logging-warm-8 292 logging-warm-2 294 logging-warm-12 305 logging-warm-4 306 logging-warm-22 307 logging-warm-9 310 logging-warm-7 311 logging-warm-23 311 logging-warm-14 312 logging-warm-10 315 logging-warm-18 315 logging-warm-11 315 logging-warm-6 316 logging-warm-16 317 logging-warm-17 317 logging-warm-3 317 logging-warm-19 318 logging-warm-0 318 logging-warm-21 318 logging-warm-5 319 logging-warm-13 320 logging-warm-1 321 warm-disk-usage

yitiao_feiyu · June 1, 2022, 2:48am

I calculate warm-20 index weight myself, it may not right. it's as a refrence should be negative not 30+ as snapshot show.

args:  node_id=None, node=logging-warm-20, index=phpback_2022-05-12
----------------------
公式:   weightindex=  indexBalance * (shards_of_node_index_count - avg_shards_per_node_index  =  0.55 * (2- 1.043)  = 0.526
公式:   weightShard=  shardBalance * (node_shards_count- avg_shards_per_node) =  0.45 * (168- 223.652 = -25.043)
    统计 node: logging-warm-20, index: phpback_2022-05-12,  node_index_weight: -24.517

DavidTurner · June 1, 2022, 10:28am

Strangely I have just been investigating another case that looks very similar and found some strange effects when too much concurrent balancing is allowed. I opened #87279 with some more details but the short answer is "remove cluster.routing.allocation.cluster_concurrent_rebalance from your config".

system · June 29, 2022, 10:28am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Cluster shards unbalanced and keep moving shards around after upgrade to 8.8.1 Elasticsearch	19	1305	July 26, 2023
The rebalancing task lasted for more than a week without stopping, and the balance has not been reached Elasticsearch	18	1097	October 13, 2021
Warm Nodes Rebalance not occurring Elasticsearch	2	597	July 20, 2018
Shard rebalance issue Elasticsearch	5	1451	August 20, 2019
Cluster re-balancing issue Elasticsearch	4	487	January 19, 2022

The es-cluster warm-node rebalance tasks keeps going and some node shards decrease for long time

Related topics