Hi All,
I was curious if anyone knows if the guidance for the cluster.routing.allocation.* settings (ex: cluster.routing.allocation.node_concurrent_recoveries & cluster.routing.allocation.node_initial_primaries_recoveries) still accurate?
The reason I ask, is because of this phrase:
Increasing this setting may cause shard movements to have a performance impact on other activity in your cluster, but may not make shard movements complete noticeably sooner. We do not recommend adjusting this setting from its default of X.
I was recently doing a rolling restart of a large cluster (each hot node has ~1.6k shards), and it was on average taking ~1 hour for the node to recover. After messing with the settings a bit:
- cluster.routing.allocation.node_concurrent_recoveries:
2->4, then4->6 - cluster.routing.allocation.node_initial_primaries_recoveries:
4->8, then8->10
I noticed a fairly "linear" increase in recovery speed, going from ~1h -> ~30m, then ~30m -> ~20m.
So, I'm a bit curious, with all of the recent improvements to Elasticsearch, is this guidance still accurate? Does anyone else adjust these settings?
For context, I'm currently on 8.16.2