Specifics around local store recovery and out-of-sync shards

Bryan_Warner · January 25, 2022, 9:14pm

Hello,

I'm trying to understand better how local shard data is recovered after a node comes back online (e.g. after a crash). I believe this recovery is called a "local store recovery" as stated in Index recovery API | Elasticsearch Guide [7.16] | Elastic

Just as a quick side-note, I've already read some great documentation around how ES tracks in-sync shard copies as well as some of the internal shard mechanics of the transaction log, sync_id, and recovery process.

But I feel like I'm missing something regarding the "local store recovery" process. Maybe it's simpler than I imagine, but consider the following scenario:

index.unassigned.node_left.delayed_timeout is set to 5m
Node 1 has Shard A (primary) and Node 2 has Shard A' (replica). Both shards are part of the in-sync set
Node 2 crashes and leaves the cluster (e.g. hits an OOM error)
Shard A (primary) on Node 1 continues to receive indexing requests. As a result, Shard A' (replica) is now marked as inactive
The ES service on Node 2 restarts and re-joins the cluster within the 5m timeout (i.e. no shard re-allocation / shuffling took place)

So what exactly happens when Node 2 restarts and rejoins the cluster? Based on the links referenced above, it's unclear to me what steps are taken but my best guess is:

When the ES service on Node 2 restarts, it will first replay any requests in the local transaction log that have not made it into the local Lucene data-store
When Node 2 rejoins the cluster, the master node will attempt to see if Shard A' (replica) on Node 2 is in-sync.
Since it is not in-sync, the master needs to sync down any changes from Shard A (primary) on Node 1 to Shard A' (replica) on Node 2
- How does it do this?
  - Does the master use the transaction log for Shard A (primary) to re-send indexing requests to the out-of-sync replica (since the latest sync_id)?
  - Or is more low-level and compare Lucene files between the two shards?
Once Shard A' (replica) is caught up, it will re-join the active in-sync shard set

The reason I ask is because I'm trying to understand what impact setting a higher delayed_timeout has when a node leaves the cluster. I know it's generally recommended to up the default of 1m to avoid shard shuffling (which can be quite expensive), but is there a risk of setting an unusually large timeout like 30m+? (Note - I understand that a higher delayed timeout will increase the time in which your cluster might run with a reduced number of replicas, which decreases overall cluster resiliency)

At what point (if there is any), can a local out-of-sync shard not be caught with the primary? (and the primary needs to be fully replicated)

Anyway, appreciate any info and feel free to point me to any docs I might have missed that answer my question. Also, if the recovery process has changed drastically after 6.8 (currently upgrading to 7.x), can you just call that out.

Thanks
Bryan W.

DavidTurner · January 25, 2022, 9:55pm

You're asking about recovering a replica which always has type PEER (not LOCAL_SHARDS) although it does do a certain amount of recovery of the local store as part of the process.

The master isn't involved, it's all worked out between the primary and the replica. It uses both file copying and replaying operations.

That's pretty much it. Also when the replica comes back it'll have more operations to replay which can be slower than just starting from scratch.

The overall structure is largely the same in 7.x but 6.8 was released almost 3 years ago and there's been quite a few improvements and optimisations since then. Nothing that affects the above info tho.

DavidTurner · January 25, 2022, 9:56pm

Oh sorry missed this:

It will always be possible to bring it back up to date somehow.

Bryan_Warner · January 26, 2022, 4:13pm

Thanks David for you response!

Just curious - Is there any advanced documentation that describes some of the internal mechanics here? Looking to learn a bit more

Thanks
Bryan

DavidTurner · January 26, 2022, 5:42pm

Not really. The implementation has all the details of course, and isn't totally impenetrable, but it's not exactly easy going either...

github.com

elastic/elasticsearch/blob/12ad399c488f0cc60e19b5e1b29c6d569cb4351a/server/src/main/java/org/elasticsearch/indices/recovery/RecoverySourceHandler.java

/*
 * Copyright Elasticsearch B.V. and/or licensed to Elasticsearch B.V. under one
 * or more contributor license agreements. Licensed under the Elastic License
 * 2.0 and the Server Side Public License, v 1; you may not use this file except
 * in compliance with, at your election, the Elastic License 2.0 or the Server
 * Side Public License, v 1.
 */

package org.elasticsearch.indices.recovery;

import org.apache.logging.log4j.Logger;
import org.apache.logging.log4j.message.ParameterizedMessage;
import org.apache.lucene.index.CorruptIndexException;
import org.apache.lucene.index.IndexCommit;
import org.apache.lucene.index.IndexFormatTooNewException;
import org.apache.lucene.index.IndexFormatTooOldException;
import org.apache.lucene.store.IOContext;
import org.apache.lucene.store.IndexInput;
import org.apache.lucene.store.RateLimiter;
import org.apache.lucene.util.ArrayUtil;

This file has been truncated. show original

system · February 23, 2022, 5:43pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
After I restart a node or a node disconnected with cluster, all data will copy from remote, why not recovery from local Elasticsearch	2	808	November 3, 2017
Shard reallocation after rolling restart Elasticsearch	3	934	June 30, 2017
Rolling Restart -- Local Replicas do not reuse local data (2.4.2) Elasticsearch	5	968	July 21, 2017
Question on cluster restart and recovery Elasticsearch	2	311	July 6, 2017
Restarting ES on a node Elasticsearch	2	385	April 17, 2018

Specifics around local store recovery and out-of-sync shards

Related topics