Quickly restarting a node

Hi,

Trying to restart a node in our cluster as quickly as possible, I use the following procedure:

  1. I disable shard allocation except for new primaries:
curl -s -u "$username:$password" -X 'PUT' -H 'Content-Type: application/json' -d '{ "transient": { "cluster.routing.allocation.enable": "new_primaries" } }' "$cluster_url/_cluster/settings"
  1. I perform a synced flush:
curl -s -u "$username:$password" -X 'POST' "$cluster_url/_flush/synced"
  1. I restart the node.
  2. When the node joins the cluster, I re-enable shard allocation:
curl -s -u "$username:$password" -X 'PUT' -H 'Content-Type: application/json' -d '{ "transient": { "cluster.routing.allocation.enable": null } }' "$cluster_url/_cluster/settings"

My understanding is that cluster state should rapidly transition from yellow to green thanks to the synced flush. However, shard allocation hits throttling and is therefore slow as hell.

Did I miss something?

Our cluster currently contains too many shards and we are working toward reducing it. Will it solve our problem or is there other factors influencing node restart duration?

Did the response to the synced flush indicate that it was completely successful? Did you stop indexing while the node was offline? If the answer to either question is no then it's possible that the synced flush marker isn't there on every shard (either it wasn't put in place, or it was put there and then removed) and this results in a slower recovery.

Which version are you using?

The synced flush indicates that almost every shards are successful: only 22 out of 22472 failed (we really have too many shards). Indexing wasn't stopped during node restart but only a small number of shards should be touched (I estimate the maximum number to be 642).

Having 6 data nodes, 3745 (22472 / 6) shards are unassigned after a node restart and I expect maximum 107 (642 / 6) shards to be slow recovering and the remaining shards to recover very quickly (as their flush marker shouldn't have changed).

For a shard which has been touched during node restart (resulting in its flush marker changing), is its recovery duration function of its size?

I forgot to mention that we are using version 6.6.1.

It depends. In some recoveries Elasticsearch has to make a brand-new copy of the shard. It will re-use any segments that it can, but often there aren't many of these. This was the case for all recoveries in versions before 6.0, and is still the case in more recent versions if there's been too many changes (>512MB of translog), or the node has been offline for too long (>12h), or the new copy is assigned to a different node from the node that holds the previous, stale, copy of the shard.

Is that different from what you're seeing? Are you seeing shards recover that you weren't expecting to need recovery?

I'm not sure of my interpretation of the /_recovery informations here, but I see almost all our indices there.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.