Optimization for rolling restart without stopping indexing

jthoni · March 5, 2021, 10:30pm

I know the topic of shard allocation during rolling restarts is a well worn path here. When we do node restarts, we:

Disable shard allocation
Restart the node
Once the node comes back online, we reenable allocation
When back to green, we proceed to the next one

We explicitly do not stop indexing as we have multiple services and UX experiences that are dependent on being able to index data. I have found that if the node I am restarting has not had any new indexing ops, then the node comes back online almost instantly. If there have been changes, then an allocation happens.

We have two systems (one 6.8 and a 5.0 one that I am in the middle of upgrading). The 6.x recoveries are actually manageable (I think it mostly comes down to translog as there was work in 6.x to optimize this). For 5.0 it is agonizing as it does a full shard recovery if any indexing has hit the shard. This takes from 15 to 60 min to recover.

I was wondering if there are any strategies that I have not thought of to optimize this process beyond stopping incoming indexing requests (I am pursuing that separately, but I don't have a lot of control over that)?

Thanks,
~john

system · April 2, 2021, 10:30pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Shard allocation on restarted node takes too long Elasticsearch	5	3356	July 5, 2017
Trying to optimize configuration for better cluster restart/recovery Elasticsearch	8	620	July 6, 2017
Restart node after 15 mins Elasticsearch	1	80	April 22, 2024
Rolling restart elasticsearch cluster Elasticsearch	5	1815	July 5, 2017
High recovering time during rolling restart of Elasticsearch 6.2 Elasticsearch	2	395	January 4, 2019

Optimization for rolling restart without stopping indexing

Related topics