What happens when synchronization between the primary and replica shards is lost?

What happens when synchronization between the primary and replica shards is lost in Elasticsearch? :thinking:

For example, a node is disconnected and a replica is unassigned. The indexing is persist to the primary shard so the sync between the primary and replica is gone.

In this scenario does Elasticsearch recreate the replica shard from scratch, or is there a different recovery process?

Is there a maximum time or a specific condition to fix the delta between the primary and the replica shards? For example, if the node holding the replica has not reconnected to the cluster for more than 10 minutes, mark the replica shard as stale.

What is the

I read the following article and it says:
If a shard is allocated as replica, the node first copies over missing data from the node that has the primary.

Can someone please share the link of the source code of this with me?

I do not know the internals, so can not point you to the appropriate place in the source code. The blog post you linked to is however very old and a lot has changed since. It used to be that the full shard had to be replaced when it was identified as stale, but that changed with the introduction of sequence IDs shortly after the blog post you pointed to was published. Nowadays a shard copy that comes back stale will be able to get only missing data replayed based on these sequence IDs rather than be replaced completely, which is a lot more efficient. If the amount of changes exceed some threshold the entire shard will still be replaced. I use the default values so have not tried to tune this.

1 Like

Basically confirming what Christian said: there are some heuristics that will attempt to resync the replica without a full rebuild first. These heuristics are best-effort things but work in practice in many common cases - for instance, I wouldn't expect a 10-min disconnection to need a full rebuild.

There's not much scope for tuning this behaviour tho, it's all very automatic.

Most of the relevant code is in the o.e.i.recovery package.

2 Likes

Thanks for sharing the article @Christian_Dahlqvist and thanks for your answer @DavidTurner.

From the article:

For Elasticsearch version 6 and above: If a replica needs to be "brought up to date" we use the last global checkpoint known to that replica and just replay the relevant changes from the primary translog rather than an expensive large file copy. If the primary's translog was "too large" or "too old" to be able to re-play to the replica, then we fall back to the old file-based recovery.

  1. What is too large?
    index.translog.retention.size: defaults to 512mb. If the translog grows past this, we only keep this amount around.

  2. What is too old?
    index.translog.retention.age: defaults to 12h. We don't keep translog files past this age.

What I understand from the article?
If the node that holds replica shard rejoin to the cluster in 12h and if the size changes per shard is less than 512mb, the replica shard should copy the delta between pri and rep from the primary shard translog.

I'll test the following scenario.
1. Recovery test for an index that has active indexing.

  • Create a cluster with three nodes.
  • Create an index with 1 primary and 1 replica.
  • Increase index.unassigned.node_left.delayed_timeout to 15m.
  • Shutdown the node that holds the replica.
  • write new data to the index (less than 512mb)
  • Start the node that holds the replica shard after 5 minute
  • Check the elasticsearch logs and indexing rate to understand if it's copying the files from the primary or replaying the relevant changes from the primary translog.

2. Recovery test for an index that has active indexing.
... The above steps without enabling the indexing operation.

Note that the blog post I linked to describes the concepts, but as it is very old (written ahead of version 6.0) and the implementation has as far as I know evolved and been refined. The settings you refer to do not seem to be available in recent versions and it is as David said now quite automatic. I do not necessarily see the point in testing this as it is now mature functionality that works very well.

2 Likes