I have been experimenting to optimize the time it takes to restart our 2 elasticsearch clusters of 36 notes each. And I found a somewhat unorthodox way to speed up things. Some context:
We are at the moment running elasticsearch 5.5.2, with plans to upgrade to 6.x in a somewhat near future.
We have multiple clients writing to our elasticsearch clusters. We have ways to stop most of the writes during a cluster restart, but not all. So after the reboot of a node, while most shards do a fast recovery from disk, some of them have seen writes and have to be recovered from the network. Of course, the shards that are likely to see writes are the largest shards (30 to 60 GB) and take a non trivial amount of time to recover over the network.
We use attribute aware shard allocation to to spread replicas across 4 different rows in our datacenter. So to speed up restarts and to validate the stability of the cluster in the case we loose a row, we restart 3 nodes on the same row at the same time. This works great and reduces overall restart time a bit.
We limit the number of concurrent recoveries (routing.allocation.node_concurrent_recoveries, routing.indices.recovery.concurrent_streams, etc...).
During recovery, a number of shards starts initializing, but at some point, that queue is full with network based recoveries, and the fast recoveries from local storage wait for the completion of network recoveries to proceed. That wait time increases the chances that a write will happen before recovery and that a fast local recovery now needs to be a slow network recovery instead.
Playing around, I realized that if I manually allocated those unassigned shards (with /cluster/reroute), fast recoveries are processed in parallel with network recoveries. This allow me to reduce the recovery time after restarting 3 nodes from > 1h to < 30m. Not bad! For the first time in 2 years, I have been able to restart a full cluster in a single day!
I have scripted this process, but it still feels strange and somewhat ugly to force shard allocation. Am I missing a setting somewhere? Should elasticsearch prioritize recovery of local shards?
It looks like Elasticsearch 6.x and partial recoveries will solve at least some of that problem...