Is the guidance for cluster.routing.allocation.* settings still accurate?

Hi All,

I was curious if anyone knows if the guidance for the cluster.routing.allocation.* settings (ex: cluster.routing.allocation.node_concurrent_recoveries & cluster.routing.allocation.node_initial_primaries_recoveries) still accurate?

The reason I ask, is because of this phrase:

Increasing this setting may cause shard movements to have a performance impact on other activity in your cluster, but may not make shard movements complete noticeably sooner. We do not recommend adjusting this setting from its default of X.

I was recently doing a rolling restart of a large cluster (each hot node has ~1.6k shards), and it was on average taking ~1 hour for the node to recover. After messing with the settings a bit:

  • cluster.routing.allocation.node_concurrent_recoveries: 2 -> 4, then 4 -> 6
  • cluster.routing.allocation.node_initial_primaries_recoveries: 4 -> 8, then 8 -> 10

I noticed a fairly "linear" increase in recovery speed, going from ~1h -> ~30m, then ~30m -> ~20m.

So, I'm a bit curious, with all of the recent improvements to Elasticsearch, is this guidance still accurate? Does anyone else adjust these settings?

For context, I'm currently on 8.16.2