Minimizing / Avoiding churn due to relocation and rebalancing during nodal-outages in cluster

We are running a 3 node Elasticsearch 7.0.1 cluster and looking for options to handle the single-node outage scenario. Since we are using the default configuration options, when single node goes down from healthy 3-node cluster, shards (primaries and replicas) of downed-node migrate to remaining two nodes as expected (via allocation + rebalance mechanisms after “index.unassigned.node_left.delayed_timeout”).

Also, when the downed-node joins back the cluster, whole rebalancing / relocation cycle gets executed. I understand that primaries relocating to remaining nodes is something which is required (without which writes will not go through for corresponding shards).

Specific clarifications :

a. What happens to the stale replica copies when downed node joins back? I assume they get fully destroyed and re-allocation and rebalancing results in replicas getting created from scratch on node which joined. Please correct me if I am wrong

b. Are there any config options to avoid/minimize the churn due to relocation/rebalance of shards when single-node goes down and when it joins back the cluster ? We are trying to minimize or avoid above churn which can potentially impact read/write availability of shards during the period of relocation + rebalancing

Our typical index config + operational params :
Replicas : 1 per shard
Shards : 5 per index
Shards per node : ~700-800 shards
Wait-for-active shards : currently using default value of 1 (primary only). But considering value of 2 (primary+replica) for maximizing the replica-consistency
Cluster-network : Local to datacenter

Some basic constraints of deployment :

  • Adding additional data-nodes - to spread existing shards thereby reducing impact of relocations, is not an option which we can choose

Please let me know if any additional details are required

A recovery does not start by destroying anything. If there's data on the recovery target that the recovery can use then it will do so.

Also normally most replicas are not stale and then the behaviour is different again. If a node joins the cluster and it has a good copy of a shard that is currently recovering onto a different node then that recovery will be cancelled in favour of using the existing good copy instead. In 7.0.1 "good" means sync-flushed, which in practice means roughly "hasn't seen any indexing in the 5 minutes before the node left the cluster" but there might be scope to relax this in future.

The main one is the delayed_timeout setting you mentioned. A common mistake is also to set the number of concurrent recoveries too high - the cluster tends to recover more reliably if these settings are left with their default values.

But any recoveries or rebalancing shouldn't make any shards unavailable for either reads or writes, so I don't understand the mechanism by which you say this has an impact on availability. Can you clarify?

Thanks @DavidTurner for a quick reply.

It was my mistake that I was not clear on the availability aspect.

Our objective is to tolerate utmost single node failure and there is a constraint of fixed amount of local storage in each node (4 TB in each node which is run as a physical appliance). So, with indices and shards we have, we configured a replica of 1 for each shard. This allows us to have an effective storage of 1.5x (of total cluster disk capacity), with the remaining storage across the nodes used for replicas. With this configuration, we are able to effectively achieve tolerating the loss of a single node.

While doing an exercise of spinning 3 nodes and testing the 'single-node-down' behaviour, we observed following:

  1. For primary shards in the lost-node, the respective replicas in one of the other two nodes become primary

  2. For replica shards in the lost node, they get rebalanced and assigned to one of the other two nodes

Both (1) & (2) are expected behavior from Elasticsearch point of view. We are good with (1) but wondering if we can prevent from (2) happening in any way until the lost node is added back to cluster. Two reasons we want to avoid this:

a) Given that we need to tolerate the loss of only a single node, rebalancing lost replicas seems to cause lot of churn esp when the size of indexes are in several TB

b) In the worst case scenario, the size of all of our indexes could be 1.5x of total cluster disk capacity. When all 3 nodes are up & running, the replicas have enough capacity. But when a single node is down, there is not enough capacity and rebalancing lost replicas may push the system to close to 100% capacity causing regular operations to come under pressure

After the delayed allocation timeout, Elasticsearch will start to rebuild lost replicas on the remaining nodes two-at-a-time by default. Is this too much "churn"? Why does this cause a problem?

Elasticsearch will stop allocating replicas to a node once it exceeds the low disk watermark which should keep you away from an out-of-disk-space situation even if you do not have the capacity for all of your data to be fully replicated on the remaining nodes. Its goal here is to replicate as much data as possible in case another failure occurs. I'm not completely following how this puts pressure on regular operations - can you clarify this too?

Hi @DavidTurner,

Thanks for explaining how rebalancing accounts for the low-disk-watermark. It makes sense.

I think it would be better to consider a scenario for illustration so that I can convey better :

Consider 30 indices with 5 shards each and replicas=1.
So, we would have 150*2 = 300 shards in total.
Assuming an ideal balance per node - when all 3 nodes are available with sufficient disk space, each node will get a fair share of 100 shards - 50 primaries and 50 replicas each.

If one node goes down, at least 50 lost-replicas have to move to remaining 2 nodes. When there is no constraint in disk-space, each of 2 remaining nodes will get a fair-share of 25 replicas each allocated honouring other important conditions (eg. primary and replica of a shard are not co-located, low-disk-watermark condition satisfied, not more that specific count of shards should replicate concurrently etc)

This implies that

  1. Each node should have disk headroom to accommodate these additional shards - approx 1.25x + some headroom for honouring other constraints
  2. Since primary and replica of a shard cannot be colocated, new replicas on remaining 2 nodes will have to be populated with data from corresponding primaries only via network transfer

When above transition happens, as I understand (please correct me if I am wrong or missing some finer details which I am not aware of),

  • Replicas being synced will be in UNASSIGNED state and only primaries will handle all writes and reads .
  • Depending upon the size of the shards, the duration for which primaries alone taking the read+write load could be prolonged

This is what I meant by pressure on regular write + read operations being concentrated only on primaries.

Hope my clarification is better explained now

Thanks, there's a handful of things I think will be useful to clarify.

The primary/replica distinction is not particularly important here: if you lose a primary then Elasticsearch will promote the surviving replica to a primary and will do so essentially immediately. It's normally best to think of all copies of a shard as being equal - they're equivalent for reads and (with rare exceptions) equivalent for writes too.

If you have 100 shards on each node and you lose a node then Elasticsearch will aim to rebuild all 100 lost shards on the remaining nodes, ideally 50 on each. Thus you need 1.5x capacity on each node, not 1.25x, to completely tolerate the loss of a node. You can get away with less disk space, as we discussed above, because Elasticsearch will stop allocating shards when a disk gets too full. However the 1.5x capacity requirement is pretty much unavoidable for other things, e.g. searches: if you're running each node at more than 70% of its capacity to handle searches with 3 nodes then your search requirement is 3 * 70% = 210% of a node, and there's no way a 2-node cluster can deliver that. That's just a fact of life in fault-tolerant systems.

The write load on a shard should be no different whether there are replicas or not. The read load on a shard, however, is most concentrated just after the node fails, and this concentration will diminish over time as replicas are rebuilt. The faster the replicas are rebuilt, the quicker the read load can be spread out. But this isn't what you seemed to be saying earlier:

I still don't see how the churn itself impacts the availability of shards in your case. Rebuilding as many replicas as fast as possible seems like a wholly good thing from an availability point of view, even though every rebuilt shard will need to be relocated back onto the lost node when it returns.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.