Elasticsearch rolling restart recovery is slow

I am trying to improve rolling restart process of our Elasticsearch cluster and looking for advice. We have a 8 node cluster on version 5.6.3:

  • 5 Data nodes
  • 3 master nodes
  • 1 Coordinator node (with Kibana on it)

The issue is when restarting a data node, the whole cluster goes yellow and 20% of the shards go unassigned. It then takes 30+ minutes for the cluster to recover. Sometimes few shards get stuck and I have to run:

When I rain _cluster/allocation/explain?pretty right after a restart I see:
"unassigned_info" : {
"reason" : "NODE_LEFT",
"at" : "2019-12-13T01:57:20.401Z",
"details" : "node_left[C9PvKKRRQkiQTvb9Oz5iEQ]",
"last_allocation_status" : "no_attempt"

I am trying to avoid all this shard re-allocation for just a reboot and speed things up.

I tried following this guide https://www.elastic.co/guide/en/elasticsearch/reference/5.5/rolling-upgrades.html and seeing same behavior.

I ended up trying:

  • setting index.unassigned.node_left.delayed_timeout to 10m for all indices
  • setting cluster.routing.allocation.enable": "none" before restart (and setting back to "all" after)
  • setting cluster.routing.allocation.exclude._ip: before restart (and setting blank after)

However no matter what, 20% of the shards go unassigned. Shouldn't index.unassigned.node_left.delayed_timeout setting tell the cluster to wait before re-allocating?

Did you follow this step?

  1. Stop non-essential indexing and perform a synced flush (Optional)

You may happily continue indexing during the upgrade. However, shard recovery will be much faster if you temporarily stop non-essential indexing and issue a synced-flush request:

In such an ancient version that's about the best you can do, but there were improvements to recovery speed in 6.0 and further improvements in the 7.x series, so upgrading to a newer version would also help.

1 Like

Yes, I tried synced-flush before the reboot but it did not make much difference in recovery time. Stopping indexing is not really an option for our use case.

Thanks for the feedback. Just wanted to make sure I tried all the options before having to upgrade.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.