I'm trying to understand better how local shard data is recovered after a node comes back online (e.g. after a crash). I believe this recovery is called a "local store recovery" as stated in Index recovery API | Elasticsearch Guide [7.16] | Elastic
Just as a quick side-note, I've already read some great documentation around how ES tracks in-sync shard copies as well as some of the internal shard mechanics of the transaction log, sync_id, and recovery process.
But I feel like I'm missing something regarding the "local store recovery" process. Maybe it's simpler than I imagine, but consider the following scenario:
index.unassigned.node_left.delayed_timeoutis set to 5m
- Node 1 has Shard A (primary) and Node 2 has Shard A' (replica). Both shards are part of the in-sync set
- Node 2 crashes and leaves the cluster (e.g. hits an OOM error)
- Shard A (primary) on Node 1 continues to receive indexing requests. As a result, Shard A' (replica) is now marked as inactive
- The ES service on Node 2 restarts and re-joins the cluster within the 5m timeout (i.e. no shard re-allocation / shuffling took place)
So what exactly happens when Node 2 restarts and rejoins the cluster? Based on the links referenced above, it's unclear to me what steps are taken but my best guess is:
When the ES service on Node 2 restarts, it will first replay any requests in the local transaction log that have not made it into the local Lucene data-store
When Node 2 rejoins the cluster, the master node will attempt to see if Shard A' (replica) on Node 2 is in-sync.
Since it is not in-sync, the master needs to sync down any changes from Shard A (primary) on Node 1 to Shard A' (replica) on Node 2
How does it do this?
Does the master use the transaction log for Shard A (primary) to re-send indexing requests to the out-of-sync replica (since the latest sync_id)?
Or is more low-level and compare Lucene files between the two shards?
- How does it do this?
Once Shard A' (replica) is caught up, it will re-join the active in-sync shard set
The reason I ask is because I'm trying to understand what impact setting a higher delayed_timeout has when a node leaves the cluster. I know it's generally recommended to up the default of
1m to avoid shard shuffling (which can be quite expensive), but is there a risk of setting an unusually large timeout like
30m+? (Note - I understand that a higher delayed timeout will increase the time in which your cluster might run with a reduced number of replicas, which decreases overall cluster resiliency)
- At what point (if there is any), can a local out-of-sync shard not be caught with the primary? (and the primary needs to be fully replicated)
Anyway, appreciate any info and feel free to point me to any docs I might have missed that answer my question. Also, if the recovery process has changed drastically after 6.8 (currently upgrading to 7.x), can you just call that out.