How to avoid initializing_shards during restarts

Hi Folks, so we have 3 node ElastichSearch cluster, when we do rollout restart we see a case like there are shards which was primary on restarted node A and now the replica on other node B promoted as primary and we see new replica started to be initializing on node C and this initializing can take very long time, is there any way to avoid new replica initializiation on node C but reuse the old primary on node A as starting point for new replica?

Hey.

Have a look at Upgrade Elasticsearch | Elastic Docs (step 1):

  1. Disable shard allocation.When you shut down a data node, the allocation process waits for index.unassigned.node_left.delayed_timeout (by default, one minute) before starting to replicate the shards on that node to other nodes in the cluster, which can involve a lot of I/O. Since the node is shortly going to be restarted, this I/O is unnecessary. You can avoid racing the clock by disabling allocation of replicas before shutting down data nodes:
PUT _cluster/settings
{
  "persistent": {
    "cluster.routing.allocation.enable": "primaries"
  }
}

@dadoonet yes checked it and doing that but right after I revert this change when node healthy I see replica shards which existed on this restarted node are re-assigned back to it and all good but initializiation still happens for new replicas on 3rd node but I would prefer elasticsearch to use old primary shard from node A as new replica foundation which should be much faster than allocating new replica to 3rd node.

Note: I see if we do dirty restart without waiting for green status cluster state restores to green fast, looks like no replica initializations happen to different node (all replicas stay on their node), but if we wait green status between restarts then because primary changes to different node and new replica initialises things takes very, very long time and my question is even that new replica initialisation happens can’t we just use old primary on restarted node as source for new replica to fasten things up? when you have 1000 shards with 30GB data each initializations of replicas on new nodes would be pain, big problem

The recomendation to wait for the cluster to be green before proceding with the rolling upgrade is to make sure that your indices with replicas have working replicas, to avoid any data loss in case of any other issue during the upgrade.

It is not a hard requirement, if the risk of data loss is something that is ok for you, you can proceed when the cluster is yellow or even red.

Besides that, there is not much else you can do, upgrading a cluster can take a long time.

I have an elastic cloud managed cluster that takes something between 8 to 9 hours to finish the upgrade process.

3 Likes

@leandrojmp thank you for the explanation but one thing I would like to understand so if recommendation followed and green status waited in this case imagine node A had 500 primary replicas and each with size of 30GB and replica of this shards was on node B, we restarted the node A and replicas on node B promoted to primary and we see that node C started to initialise replica shards for these indices, is this expected? even node A comes up still node C will try to initiliase replicas? (after we enable allocation for all not just primaries) 500 shards with 30gb each can take very long time in this case. If you say this is expected and it is trade off for recommended scenario then we would rather risk temporary red state and do not wait for green, in that case rollout restart/upgrade is much faster, we would like to avoid large data transfers between nodes during restarts

It’s your data, so your call.

I try not to have red state, certainly don’t plan BAU work around hoping it’ll only be red temporarily. Client queries, indexing, … are problematic when index is red. YMMV.

You can avoid nodeC initialising new replica shards if you want, there are allocation settings to do this. But it’s a balance, if nodeA down too long / doesn’t come back you have other problems/risks. You could define more replicas to help mitigate.

You can avoid nodeC initialising new replica shards if you want

@RainTown can you please explain how? how we can achieve it and also have green state?

generally so we can say this “if we follow recommended safe upgrade/restart path then it can take very long time and it will do quiet much transfer of data between hosts? replicas initialisation” , we are kinda looking for this confirmation because from first glance moving of such large data between hosts(replica initialising) seems inefficient

Green is not the important point, just don't be (too) afraid of yellow state. The docs describe as follows:

Yellow health status: The cluster has no unassigned primary shards but some unassigned replica shards.

if you shut down a node, unless it was devoid of shards (which you can do, but ...), you are going to have a period where some shard is unallocated, so your cluster will be yellow. But if you use the options described above, e.g. increase index.unassigned.node_left.delayed_timeout, and the node comes back within that time, you will avoid a lot of shard movement and cluster will be green again fairly quickly. And you can then move on to next node ....

But you dont get a completely free pass, as all the time your cluster is one-node down, you are at higher risk than when at full complement.

@RainTown thank you, so this means if all done correctly then shards not supposed to be transfered between nodes? even green state waited? like not even replicas initialization?

Note: I see my cluster had watermark issues, it was on limits. It could be what causes that initializations. It is test cluster, I will free up some space and retry the tests. Thank you

You always want to ensure you have enough free space that your cluster is sufficient when there are N-1 nodes available. i.e. running a cluster of say 5 nodes with all nodes at 84% disk usage is just a bad idea, even when it's working fine, as that full data set cannot fit into 4 nodes.

As to maintenance, here's a simple sequence. A 3 node, symmetric, cluster:

$ escurl /_cat/shards | fgrep testindex | tr -s ' ' | sort
testindex1 0 p STARTED 16 38.5kb 38.5kb 192.168.122.242 rhel10x2
testindex1 0 r STARTED 16 43.7kb 43.7kb 192.168.122.177 rhel10x3
testindex2 0 p STARTED 19 43.9kb 43.9kb 192.168.122.152 rhel10x1
testindex2 0 r STARTED 19 43.9kb 43.9kb 192.168.122.177 rhel10x3
testindex3 0 p STARTED 24 44.3kb 44.3kb 192.168.122.152 rhel10x1
testindex3 0 r STARTED 24 44.3kb 44.3kb 192.168.122.177 rhel10x3

3 indices, spread over the 3 nodes. Not that rhel10x3 has only 3x replicas, no primary shards. Set unassigned.node_left.delayed_timeout for 5m for these indices and reboot rhel10x3

while its down, cluster is yellow and:

$ escurl /_cat/shards | fgrep testindex | tr -s ' ' | sort
testindex1 0 p STARTED 16 38.5kb 38.5kb 192.168.122.242 rhel10x2
testindex1 0 r UNASSIGNED
testindex2 0 p STARTED 19 43.9kb 43.9kb 192.168.122.152 rhel10x1
testindex2 0 r UNASSIGNED
testindex3 0 p STARTED 24 44.3kb 44.3kb 192.168.122.152 rhel10x1
testindex3 0 r UNASSIGNED

as expected, the shards are just UNASSIGNED for the time being.

When rhel10x3 returns:

escurl /_cat/shards | fgrep testindex | tr -s ' ' | sort
testindex1 0 p STARTED 16 38.5kb 38.5kb 192.168.122.242 rhel10x2
testindex1 0 r STARTED 16 43.7kb 43.7kb 192.168.122.177 rhel10x3
testindex2 0 p STARTED 19 43.9kb 43.9kb 192.168.122.152 rhel10x1
testindex2 0 r STARTED 19 43.9kb 43.9kb 192.168.122.177 rhel10x3
testindex3 0 p STARTED 24 44.3kb 44.3kb 192.168.122.152 rhel10x1
testindex3 0 r STARTED 24 44.3kb 44.3kb 192.168.122.177 rhel10x3

everything is exactly as it was/where before the reboot.

The complications are when you have many indices, the cluster's effort to maintain balance will impact things. This would be more pronounced if you already close to limits.

Hi,

During a node restart, Elasticsearch promotes replicas to primary and initializes new replicas, so the old primary can’t be reused. To reduce initialization time, do rolling restarts one node at a time, tune node_concurrent_recoveries, max_bytes_per_sec, and shard allocation settings, ensuring faster shard recovery and minimal downtime.

Welcome to the forum !!

I believe you are being too definitive there. That "cant be re-used" should be a "may not be reused", at least AFAIK.

If you (or anyone else) know the specific code paths then I am happy to be corrected, every day is a school day, but let's say my own experience has appeared to be different.

Hi again, one more thing noticed, I am following the safe approach and waiting the green state , before the restart each node (1,2,3) had equal amount of primaries of indices, it is like we had 1000p and 2000r (1p 2r) and each had primaries distrubuted evenly (~330 each) , each shard is only 7mb. After first rollout restart we see that node-1 got all the primaries allocated to it… is this expected or fine?

Er, no, not expected.

Your counts here are real or just for illustrations? They are close to the default shard limit per node (which is 1000), so maybe playing a role here?

@RainTown it is real number, we have set 5000 shards as node limit.