Rolling restart, replica allocation, cluster.routing.allocation.enable vs. index.unassigned.node_left.delayed_timeout

Assume Elastic cluster with 12 data nodes. Index template has 3 replicas (1 primary + 3 replicas copies), and index.write.wait_for_active_shards setting configured to 2.

"settings": {
  "number_of_shards": 5,
  "number_of_replicas": 3,
  "index.write.wait_for_active_shards" : 2

Data nodes will store PB of data, and I do not want any shards to be relocated once any 1 or 2 nodes are unavailable (acomodate outage of maximum 2 data nodes). We do not have ephemeral servers in the cluster and outage of a single server is an incident and should be resolved as soon as possible. On the other hand, I cannot guarantee maximum time duration a node can be unavailable. It can take 1 hour, 4 hours, a day or even a week. I do not want to race against the clock when things go wrong :).

Before restarting a node, I have configured cluster.routing.allocation.enable = primaries. One node goes down and all primaries are relocated to other nodes which had replicas of same shard assigned. Replicas assigned to given node remain unassigned.

$ curl -X GET -H "Content-Type: application/json" http://localhost:9201/_cat/shards
test-2022-11 2 p STARTED    1 3.6kb es02
test-2022-11 2 r STARTED    1 3.6kb es01
test-2022-11 2 r UNASSIGNED                    
test-2022-11 1 r STARTED    1 3.6kb es02
test-2022-11 1 p STARTED    1 3.6kb es01
test-2022-11 1 r UNASSIGNED                    
test-2022-11 0 p STARTED    0  208b es02
test-2022-11 0 r STARTED    0  208b es01
test-2022-11 0 r UNASSIGNED   

Clients can still write and read data, great. Next, I am going to bring the node back online. My expectation would be that unassigned replicas get assigned again to restarted node. Node may lack some data, so this will be replicated before it can serve client requests.

Restarted node joins the cluster, but none of replica shards get assigned. Of course this behaviour is correct, because I have told the cluster previously to relocate only primary shards. I had to configure cluster.routing.allocation.enable = all to trigger replica assignment.


  1. I would like to skip the necessity of resetting cluster.routing.allocation.enable to different values during node unavailability and after it recovers. Is it acceptable to set index.unassigned.node_left.delayed_timeout to large value, like one month or a year? Will it solve the problem of shard allocation?
  2. Once node is restarted and holds data for some replica shards, will shard allocation algorithm assign those replica shards to this node? After setting cluster.routing.allocation.enable = all, my intention would be that unassigned shards go back to the node that was restarted, not the other available node.

Any pointers highly appreciated.