A question about primary/replica re-sync implementation

Let's say we have 3 replicas for one index shard and they have the following local op seqs:
node#1(primary) : 1, 2, 3, 4, 5, 6, 7, 8 (local checkpoint: 8, max seqNo: 8, global checkpoint: 5)
node#2(replica#1): 1, 2, 3, 4, 5, 7, 8 (local checkpoint: 5, max seqNo: 8, global checkpoint: 5)
node#3(replica#2): 1, 2, 3, 4, 5, 6, 7 (local checkpoint: 7, max seqNo: 7, global checkpoint: 5)

Suppose node#1 crashed, and node#3(replica#2) is promoted to new primary. Then there will be a re-sync during which node#3 sends op 6 and 7 to node#2. Also for node#2, the re-sync would trim all its translog ops that are above max seqNo of node#3 (namely, seq 8 would be trimmed).

The question here is after re-sync, node#3 would only have changes from seq 1 to seq 7 in its local Lucene. However, node#2 would have the additional change of seq 8 (the translog trimming doesn't rollback the changes in lucene). So there could be data in-consistency between node#3 and node#2 in their lucenes after re-sync (though their translogs are consistent)。

Have I missed anything here?

@iamorchid When a replica detects the new primary, it will rollback its Lucene index, then recover locally up to the global checkpoint. In your example, operation #8 won't exist in the copy of node#2 after the primary-replica resync.

1 Like

Thanks for your reply. Do we have such logic also for 6.4.2 ? Currently, I'm looking at 6.4.2 implemenation (and didn't notice such logic so far). Not sure if this is added in newer version.

Hi @iamorchid,

It was implemented in 6.5.0 (see https://github.com/elastic/elasticsearch/pull/33473).