I have noticed that in version 8.10.2, shard allocation seems to prioritize on storage space first then the shard delta.
Which is fine. I don't really have a preference either way.
But as the result, shard moving seems to be happening more often now.
By default, the shard count delta between data nodes are larger in version 8.10.2.
So I constantly see recovery tasks running.
This is not the case for version 7.15. Older version is more strict on maintaining shard count difference. So I can set "cluster.routing.allocation.balance.threshold" to say 3 and the delta will remain within 3 between data nodes. I rarely see shard recovery (unless I delete indices).
But with version 8.10.2, I always see recovery. Is this expected?
Version 8.10.2 seems to be in a perpetual cycle of shard rebalancing.
After setting "cluster.routing.allocation.cluster_concurrent_rebalance" to 30 and the rebalancing jobs jumped from 2 to 30 immediately.
I'll keep it at 30 to see if it's simply too many for the default of 2 to catch up...
But fundamentally seems to be the issue of too many shard movements. The cluster doesn't settle to a state.
We do create/delete some indices hourly & daily, but version 7.15 suggests our pattern of usage was not an issue for older algorithm.
This typically means you should increase cluster.routing.allocation.balance.threshold. Looks like you currently have it set at 3, but you can reasonably go much higher than that.
ok. I'll give that a try.
Just to mention that if I set it to 3, it's never settle below delta of 3. I attributed to prioritizing on storage with 8.10.2.
Even with delta above the threshold, I do see the cluster settle for period of times. Meaning no rebalancing occuring, which I interpreted that threshold is no longer useful.
I'll give it a larger value to see if it's less jittery.
Yeah 3 no longer means "shard counts balanced within ±3", it's balancing a mix of shard count, disk space and write load. That means one way to balance the cluster would be to put lots of tiny shards on one node and the few remaining enormous shards on the other node.
Hi @linkerc what value did you set for cluster.routing.allocation.balance.threshold and it still does shard movements. Also does it keeping your cluster in unbalanced state?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.