I know the topic of shard allocation during rolling restarts is a well worn path here. When we do node restarts, we:
- Disable shard allocation
- Restart the node
- Once the node comes back online, we reenable allocation
- When back to green, we proceed to the next one
We explicitly do not stop indexing as we have multiple services and UX experiences that are dependent on being able to index data. I have found that if the node I am restarting has not had any new indexing ops, then the node comes back online almost instantly. If there have been changes, then an allocation happens.
We have two systems (one 6.8 and a 5.0 one that I am in the middle of upgrading). The 6.x recoveries are actually manageable (I think it mostly comes down to translog as there was work in 6.x to optimize this). For 5.0 it is agonizing as it does a full shard recovery if any indexing has hit the shard. This takes from 15 to 60 min to recover.
I was wondering if there are any strategies that I have not thought of to optimize this process beyond stopping incoming indexing requests (I am pursuing that separately, but I don't have a lot of control over that)?