When you are intentionally restarting nodes, such as during a rolling restart while upgrading Elasticsearch, you should generally disable allocation to prevent the rest of the cluster from reacting to the loss of nodes. This is true of the cluster while the node shuts down and starts back up.
PUT /_cluster/settings
{
"transient" : {
"cluster.routing.allocation.enable" : "none"
}
}
The above example uses "none" for the value of the setting transiently, which means it anticipates a rolling restart. Transient settings do not survive full cluster restarts. If you need to perform a full cluster restart, then you should use "persistent" in place of "transient".
Setting "cluster.routing.allocation.enable" to "none" prevents any new indices from being created and it prevents any new primaries or replicas from being allocated across the cluster. You use this setting when you restart a node to avoid automatic recovery from reassigning primary shards and creating new replicas that are temporarily missing. As long as the node comes back relatively quickly, then it would be a waste of time, memory, and bandwidth to perform any recovery.
Now, once you have completed restarting the node, it should rejoin the cluster and restart any primary shards that are missing from the cluster. However, it will not restart any replica shards until you re-enable allocation.
PUT /_cluster/settings
{
"transient" : {
"cluster.routing.allocation.enable" : "all"
}
}
Once you have re-enabled allocation, the replicas should begin to initialize and become active within the cluster.
For example, you can observe what happens on a two node cluster where the second node to be restarted has 4 primary shards that do not have any replicas. All other shards have one replica each. After disabling allocation persistently for a full cluster restart, I restarted the first node and checked the cluster's health (note it's red because it knows that there are 4 missing primaries!):
{
"cluster_name" : "elasticsearch",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 1,
"number_of_data_nodes" : 1,
"active_primary_shards" : 31,
"active_shards" : 31,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 31
}
Now, after restarting the second node (containing 4 primary shards that did not have any replicas) and checking the cluster's health (note it's yellow because there are unassigned replicas for primaries, but all primaries are now active):
{
"cluster_name" : "elasticsearch",
"status" : "yellow",
"timed_out" : false,
"number_of_nodes" : 2,
"number_of_data_nodes" : 2,
"active_primary_shards" : 35,
"active_shards" : 35,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 27
}
Finally, now that the full cluster is back online, you can re-enable allocation ("all") and all replicas should initialize as well (note it's green because replicas are now active).
{
"cluster_name" : "elasticsearch",
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 2,
"number_of_data_nodes" : 2,
"active_primary_shards" : 35,
"active_shards" : 62,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0
}