Elasticsearch rolling restart recovery is slow

nemtso · December 13, 2019, 3:03am

Hi,
I am trying to improve rolling restart process of our Elasticsearch cluster and looking for advice. We have a 8 node cluster on version 5.6.3:

5 Data nodes
3 master nodes
1 Coordinator node (with Kibana on it)

The issue is when restarting a data node, the whole cluster goes yellow and 20% of the shards go unassigned. It then takes 30+ minutes for the cluster to recover. Sometimes few shards get stuck and I have to run:
_cluster/reroute?retry_failed

When I rain _cluster/allocation/explain?pretty right after a restart I see:
...
"unassigned_info" : {
"reason" : "NODE_LEFT",
"at" : "2019-12-13T01:57:20.401Z",
"details" : "node_left[C9PvKKRRQkiQTvb9Oz5iEQ]",
"last_allocation_status" : "no_attempt"
},
...

I am trying to avoid all this shard re-allocation for just a reboot and speed things up.

I tried following this guide https://www.elastic.co/guide/en/elasticsearch/reference/5.5/rolling-upgrades.html and seeing same behavior.

I ended up trying:

setting index.unassigned.node_left.delayed_timeout to 10m for all indices
setting cluster.routing.allocation.enable": "none" before restart (and setting back to "all" after)
setting cluster.routing.allocation.exclude._ip: before restart (and setting blank after)

However no matter what, 20% of the shards go unassigned. Shouldn't index.unassigned.node_left.delayed_timeout setting tell the cluster to wait before re-allocating?

DavidTurner · December 13, 2019, 7:04am

Did you follow this step?

Stop non-essential indexing and perform a synced flush (Optional)

You may happily continue indexing during the upgrade. However, shard recovery will be much faster if you temporarily stop non-essential indexing and issue a synced-flush request:

In such an ancient version that's about the best you can do, but there were improvements to recovery speed in 6.0 and further improvements in the 7.x series, so upgrading to a newer version would also help.

nemtso · December 13, 2019, 7:56pm

Yes, I tried synced-flush before the reboot but it did not make much difference in recovery time. Stopping indexing is not really an option for our use case.

Thanks for the feedback. Just wanted to make sure I tried all the options before having to upgrade.

system · January 10, 2020, 7:56pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Quick recovery after node restart in elasticsearch Elasticsearch	5	2294	July 6, 2017
Elasticsearch quick recovery after restart Elasticsearch	3	533	July 6, 2017
Quickly restarting a node Elasticsearch	6	643	April 11, 2019
Shard allocation on restarted node takes too long Elasticsearch	5	3505	July 5, 2017
Restarting of node taking much time Elasticsearch	6	2489	July 6, 2017

Elasticsearch rolling restart recovery is slow

Related topics