Weird rebalancing strategy

I am seeing several active recovery tasks where 1 shard is moving from Node A -> Node B, but at the same time for a different shard, it's moving from Node C to Node A.
Wouldn't this cause rebalancing to never end?

Why would a node be both source & destination in rebalancing?

This is sort of what I am experiencing at the moment. I added 4 new nodes to my cluster and the shard count fluctuates up and down with high CPU on those 4 new nodes. Rebalancing is taking a very long time.
Is this normal and expected?

Do you have any non-default settings that could affect the behaviour? Which version of Elasticsearch are you using?

7.3.1
Yes. I have increased the node and task counts like below.

PUT _cluster/settings
{
  "persistent":{
    "cluster.routing.allocation.node_concurrent_recoveries":10,
    "cluster.routing.allocation.cluster_concurrent_rebalance":60
  }
}

I make similar changes whenever I add new nodes to speed up the rebalancing. This is the first time I've seen such behavior where new nodes will never reach the shard count as others. The shard count on those new nodes just fluctuate up and down and I noticed one new node being on both the source and destination. It seems a wasted move.
This is the first time we have increased the data node size to 25. With 21 data nodes and below, I have never seen this on the same version of ES (7.3.1)

I think this can happen if you adjust those settings, you should revert those changes. The best way to speed up recoveries is with indices.recovery.max_bytes_per_sec. 7.3.1 is really old too, long past EOL, IIRC there were some big improvements to recovery speed in more recent versions so you should upgrade to a supported version as a matter of urgency too.

3 Likes

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.