Hi,
I am trying to improve rolling restart process of our Elasticsearch cluster and looking for advice. We have a 8 node cluster on version 5.6.3:
- 5 Data nodes
- 3 master nodes
- 1 Coordinator node (with Kibana on it)
The issue is when restarting a data node, the whole cluster goes yellow and 20% of the shards go unassigned. It then takes 30+ minutes for the cluster to recover. Sometimes few shards get stuck and I have to run:
_cluster/reroute?retry_failed
When I rain _cluster/allocation/explain?pretty right after a restart I see:
...
"unassigned_info" : {
"reason" : "NODE_LEFT",
"at" : "2019-12-13T01:57:20.401Z",
"details" : "node_left[C9PvKKRRQkiQTvb9Oz5iEQ]",
"last_allocation_status" : "no_attempt"
},
...
I am trying to avoid all this shard re-allocation for just a reboot and speed things up.
I tried following this guide https://www.elastic.co/guide/en/elasticsearch/reference/5.5/rolling-upgrades.html and seeing same behavior.
I ended up trying:
- setting index.unassigned.node_left.delayed_timeout to 10m for all indices
- setting cluster.routing.allocation.enable": "none" before restart (and setting back to "all" after)
- setting cluster.routing.allocation.exclude._ip: before restart (and setting blank after)
However no matter what, 20% of the shards go unassigned. Shouldn't index.unassigned.node_left.delayed_timeout setting tell the cluster to wait before re-allocating?