Shard recovery in 8.10.2 seems to happen more often

I have noticed that in version 8.10.2, shard allocation seems to prioritize on storage space first then the shard delta.
Which is fine. I don't really have a preference either way.
But as the result, shard moving seems to be happening more often now.
By default, the shard count delta between data nodes are larger in version 8.10.2.
So I constantly see recovery tasks running.
This is not the case for version 7.15. Older version is more strict on maintaining shard count difference. So I can set "cluster.routing.allocation.balance.threshold" to say 3 and the delta will remain within 3 between data nodes. I rarely see shard recovery (unless I delete indices).

But with version 8.10.2, I always see recovery. Is this expected?

Which version were you running before?

Did you change any of the default settings regarding rebalance?

  • cluster.routing.allocation.cluster_concurrent_rebalance
  • cluster.routing.allocation.node_concurrent_incoming_recoveries
  • cluster.routing.allocation.node_concurrent_outgoing_recoveries
  • cluster.routing.allocation.node_concurrent_recoveries
1 Like

7.15
For 7.15, we change the following:

PUT _cluster/settings
{
  "persistent": {
    "cluster.routing.allocation.balance.threshold": 3,
    "cluster.routing.allocation.cluster_concurrent_rebalance": null,
    "cluster.routing.allocation.node_concurrent_recoveries": null,
    "cluster.routing.allocation.node_concurrent_incoming_recoveries":null,
    "cluster.routing.allocation.node_concurrent_outgoing_recoveries":null
  }
}

So shard delta between data nodes stayed typically <= 3.

Version 8.10.2 seems to be in a perpetual cycle of shard rebalancing.
After setting "cluster.routing.allocation.cluster_concurrent_rebalance" to 30 and the rebalancing jobs jumped from 2 to 30 immediately.

I'll keep it at 30 to see if it's simply too many for the default of 2 to catch up...

But fundamentally seems to be the issue of too many shard movements. The cluster doesn't settle to a state.
We do create/delete some indices hourly & daily, but version 7.15 suggests our pattern of usage was not an issue for older algorithm.

You should use the default, there is an open issue about this, increasing this value can impact in the rebalance.

Check this issue: Increasing `cluster.routing.allocation.cluster_concurrent_rebalance` causes redundant shard movements · Issue #87279 · elastic/elasticsearch · GitHub

And also this one tracking a possible fix: Throttle recoveries on data nodes instead of master · Issue #98087 · elastic/elasticsearch · GitHub

There was also a change on version 8.6 that changed some things regarding balancing.

I had some issues when I upgraded to 8.8.1 about shards that kept moving around, as mentioned in this topic.

1 Like

Ok. Thanks.

I think part of the reason contributing to what I'm seeing is the hourly index creation/deletion in our use case with the new algo.

GET _internal/desired_balance
all of a sudden shows 10 nodes to be false on the hour.

I guess the new introduction to balancing primary on storage rendered shard count unimportant, yet it triggers rebalance.
It becomes too jittery..

After reset all cluster settings to default. I just got 40 "node_is_desired": false.

This typically means you should increase cluster.routing.allocation.balance.threshold. Looks like you currently have it set at 3, but you can reasonably go much higher than that.

ok. I'll give that a try.
Just to mention that if I set it to 3, it's never settle below delta of 3. I attributed to prioritizing on storage with 8.10.2.
Even with delta above the threshold, I do see the cluster settle for period of times. Meaning no rebalancing occuring, which I interpreted that threshold is no longer useful.

I'll give it a larger value to see if it's less jittery.

Yeah 3 no longer means "shard counts balanced within ±3", it's balancing a mix of shard count, disk space and write load. That means one way to balance the cluster would be to put lots of tiny shards on one node and the few remaining enormous shards on the other node.

thanks for the clarification.
Does this mean the default of 1000 shards per node is no longer enforced?

And setting that value to 10 seems to settle the 8.10.2 cluster. Much less rebalancing.

No, the average number of shards per (non-frozen) node must remain below 1000 by default.

Hi @linkerc what value did you set for cluster.routing.allocation.balance.threshold and it still does shard movements. Also does it keeping your cluster in unbalanced state?