Respect bootstrapNewHistoryUUID option when restoring from snapshot

Shall we enable bootstrapNewHistoryUUID option in SnapshotRecoverySource, which disable bootstraping new history uuid of the primary shard index when set false.
In some environment, the primary shard may be restored from external snapshot(that is, the snapshot is not created by current elasticsearch cluster) and we want to keep origin history instead of bootstraping new historyUUID and translogUUID.

For example, in our environment, we build the offline elasticsearch index snapshot using spark(that is, from db data source to elasticsearch index snapshot) and develop a elasticsearch plugin to help restore this index. For primary shard, we using the built-in snapshot restore, but for replica shard, we restore the index from snapshot in IndexEventListener.beforeIndexShardRecovery. We want the peer recovery of replica shard to use sequence-based recovery and skip phase1, which needs identical historyUUID in both primary and replica shard.

Below is the commit for this topic:

It needs more than that, it needs identical history in both shards (not just a fake matching history UUID). Restoring a snapshot loses that the history is identical, and also loses that the right operations are available for a sequence-number-based recovery anyway. In short, phase 1 is required after a snapshot restore.

Perhaps it would work better for you to mount your snapshot as a searchable snapshot.

I think we can achieve this goal (skipping phase1 in peer recovery) by performing following steps:

  1. We implement a plugin to restore the snapshot on both primary and replica shards, which begins with a standard primary snapshot recovery;
  2. Then when we recover the replica shard, we pre-restore the snapshot in IndexEventListener.beforeIndexShardRecovery, which is possible by implementing our custom IndexEventListener;
  3. After step2, we should also add a retention lease of current replica shard to the primary, which is required for sequence-based recovery;
  4. Finally, If the primary shard's history uuid is identical with that in the snapshot, then it can perform the sequence-based recovery and skip phase1.

For this scenario, shall we enable bootstrapNewHistoryUUID option in SnapshotRecoverySource, which disable bootstrapping new history uuid of the primary shard index when set false.

Yes, this is pretty much how mounting a searchable snapshot works, which is why I think you should be using it. Note in particular that IndexEventListener#beforeIndexShardRecovery was added precisely for use by searchable snapshots.

I agree with that. Searchable-snapshot in some scenarios, such as mounting a read-only index, meets our needs. However, we still want to index/delete documents after the shards are recovered, which is not allowed for the searchable-snapshot directory is immutable.

Ok if you want to index/delete documents after the restore then you definitely need a new history UUID since you are creating a new (forked) history.

It's possible you could make Elasticsearch more intelligent about re-using segments from a different history in a recovery. IIRC these days each segment has its own UUID which (together with filenames/sizes/checksums) ought to be enough to identify the segment even if the history UUID is different.

I checked, and it looks like we already do this? I created a small index, took a snapshot, then restored the snapshot, and got these results:

GET /_cat/recovery?v
# index shard time  type     stage source_host source_node target_host target_node repository   snapshot files files_recovered files_percent files_total bytes bytes_recovered bytes_percent bytes_total translog_ops translog_ops_recovered translog_ops_percent
# i     0     510ms snapshot done  n/a         n/a         127.0.0.1   node-1      default-repo snap-1   1     1               100.0%        4           313   313             100.0%        3080        0            0                      100.0%
# i     0     893ms peer     done  127.0.0.1   node-1      127.0.0.1   node-0      n/a          n/a      1     1               100.0%        4           313   313             100.0%        3080        0            0                      100.0%

Note the files_recovered column only says 1 even though these shard copies contain more than one file -- everything else already exists on disk and is re-used. The file in question is the segments_N file which makes sense, this is the only one that changes.

That's right, phase1 can be accelerated by reusing the identical segment files even though peer recovery does not perform the sequence-number based recovery.
Let's consider the scenario that replica shard performs the peer recovery after the primary shard has started and accepts index operations, which may result in segments merge and addition files transferring in phase1.

The replica should start its recovery very soon after the primary shard starts, and the segments to transfer are captured very early in the recovery process, so there are normally very few changes to the segments transferred in phase 1.

Or, to put it a different way, if there are a lot of indexing operations to catch up then it's much more efficient to do so with a file-based recovery. Recovering individual operations is an expensive process, and only really suitable for recovering a replica after a very short outage.

Agree with that transferring diff segment files is better than replaying the expensive indexing operations and we'll refactor our plugin when restoring an index from snapshot.
Thanks David for your kindly help!