Specifics around local store recovery and out-of-sync shards


I'm trying to understand better how local shard data is recovered after a node comes back online (e.g. after a crash). I believe this recovery is called a "local store recovery" as stated in Index recovery API | Elasticsearch Guide [7.16] | Elastic

Just as a quick side-note, I've already read some great documentation around how ES tracks in-sync shard copies as well as some of the internal shard mechanics of the transaction log, sync_id, and recovery process.

But I feel like I'm missing something regarding the "local store recovery" process. Maybe it's simpler than I imagine, but consider the following scenario:

  • index.unassigned.node_left.delayed_timeout is set to 5m
  • Node 1 has Shard A (primary) and Node 2 has Shard A' (replica). Both shards are part of the in-sync set
  • Node 2 crashes and leaves the cluster (e.g. hits an OOM error)
  • Shard A (primary) on Node 1 continues to receive indexing requests. As a result, Shard A' (replica) is now marked as inactive
  • The ES service on Node 2 restarts and re-joins the cluster within the 5m timeout (i.e. no shard re-allocation / shuffling took place)

So what exactly happens when Node 2 restarts and rejoins the cluster? Based on the links referenced above, it's unclear to me what steps are taken but my best guess is:

  • When the ES service on Node 2 restarts, it will first replay any requests in the local transaction log that have not made it into the local Lucene data-store

  • When Node 2 rejoins the cluster, the master node will attempt to see if Shard A' (replica) on Node 2 is in-sync.

  • Since it is not in-sync, the master needs to sync down any changes from Shard A (primary) on Node 1 to Shard A' (replica) on Node 2

    • How does it do this?
      • Does the master use the transaction log for Shard A (primary) to re-send indexing requests to the out-of-sync replica (since the latest sync_id)?

      • Or is more low-level and compare Lucene files between the two shards?

  • Once Shard A' (replica) is caught up, it will re-join the active in-sync shard set

The reason I ask is because I'm trying to understand what impact setting a higher delayed_timeout has when a node leaves the cluster. I know it's generally recommended to up the default of 1m to avoid shard shuffling (which can be quite expensive), but is there a risk of setting an unusually large timeout like 30m+? (Note - I understand that a higher delayed timeout will increase the time in which your cluster might run with a reduced number of replicas, which decreases overall cluster resiliency)

  • At what point (if there is any), can a local out-of-sync shard not be caught with the primary? (and the primary needs to be fully replicated)

Anyway, appreciate any info and feel free to point me to any docs I might have missed that answer my question. Also, if the recovery process has changed drastically after 6.8 (currently upgrading to 7.x), can you just call that out.

Bryan W.

You're asking about recovering a replica which always has type PEER (not LOCAL_SHARDS) although it does do a certain amount of recovery of the local store as part of the process.

The master isn't involved, it's all worked out between the primary and the replica. It uses both file copying and replaying operations.

That's pretty much it. Also when the replica comes back it'll have more operations to replay which can be slower than just starting from scratch.

The overall structure is largely the same in 7.x but 6.8 was released almost 3 years ago and there's been quite a few improvements and optimisations since then. Nothing that affects the above info tho.

Oh sorry missed this:

It will always be possible to bring it back up to date somehow.

Thanks David for you response!

Just curious - Is there any advanced documentation that describes some of the internal mechanics here? Looking to learn a bit more


Not really. The implementation has all the details of course, and isn't totally impenetrable, but it's not exactly easy going either...

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.