Confusions about how globlal checkpoint advances in es 7.6

In es 7.6, I see that we introduced persisted local checkpoint and processed local checkpoint. After each update replication, replicas return their persisted local checkpoint to primary shard. And based on these checkpoints from in-sync replicas, primary would advance the global check point. However, since persisted local checkpoint could lag behind the op seq that's really processed(namely processed local checkpoint),I wonder is it possible for the global checkpoint to catch up to MAX_SEQ.

Here I given an example to illustrate this. Suppose we have 3 copies for one shard (suppose all copies have inital seq no 6 for both persisted and processed local checkpoint). And after replicating an update, we could have:

Primary : persisted local checkpoint is 6, processed local checkpoint 7
Replica1: persisted local checkpoint is 6, processed local checkpoint 7
Replica2: persisted local checkpoint is 6, processed local checkpoint 7

For the moment, the global checkpoint doesn't advance and it lags behind MAX_SEQ 7。I don't see when the global checkpoint would catch up to MAX_SEQ.

If there are more operations in flight then we know that a future operation will eventually advance the global checkpoint. If there are no more operations in flight then the GlobalCheckpointSyncAction brings everything up to date instead.

Hi @iamorchid,

I believe you are are referring to this change which went into 7.3.

The persisted local checkpoint should eventually catch up unless there is something serious wrong (bug or inability to persist translog).

Before the change mentioned above, the global checkpoint could advance based on non-durably stored information from replicas. While data were safe, we rely on all data below global checkpoint to be durably stored to ensure that replicas and primary have the same data (as well as for CCR).

Is this based on a theoretical exercise/question or have you run into an issue where the global checkpoint no longer advances?

@HenningAndersen, thanks for your reply. @DavidTurner mentioned GlobalCheckpointSyncAction, I noticed that we have the following logic in IndexShard.java to trigger the action. However, if we are not using asyncDurability, the action would not be triggered, right (suppose global checkpoint is behind MAX_SEQ)? Or do we have other chances to tirgger that action?

The question is based on my analysis about the code (I'm a new learner and hope understand better about how ES works ).

Thanks

maybeSyncGlobalCheckpoint runs periodically and regardless of the durability setting it syncs the global checkpoint if stats.getMaxSeqNo() == stats.getGlobalCheckpoint(), which is a way of expressing "no operations in flight".

But based on my analysis above, we could have stats.getMaxSeqNo() > stats.getGlobalCheckpoint(). And to make global checkpoint to cache pu to MAX_SEQ, we have to trigger GlobalCheckpointSyncAction. However, since stats.getMaxSeqNo() > stats.getGlobalCheckpoint(), the action won't be triggered.

With request-level durability every operation is properly fsynced before responding to the primary. This means that the last operation to complete on each copy will advance that copy's (persisted) local checkpoint up to the maximum sequence number and then tell the primary about it. Note that these things don't necessarily happen in sequence-number order, so the last operation isn't necessarily the one with the highest sequence number, but that doesn't affect the argument. Once all copies have completed all operations this means that the global checkpoint (on the primary) should indeed equal the maximum sequence number.

In some sense this is why we have to handle async durability differently in maybeSyncGlobalCheckpoint: with async durability it isn't true that "no operations in flight" implies stats.getMaxSeqNo() == stats.getGlobalCheckpoint() so we must be conservative and do some extra work to make sure.

I see what you mean. For some requests, it doesn't need to wait for GlobalCheckpointSyncAction and they can trigger sync before replica replies the primary, right ? I would check more details.

Thanks again for all your detailed explanations @DavidTurner @HenningAndersen !

Yes that's correct, most of the time there's ongoing indexing activity in which the primary sends replication requests to the replicas and they respond with a TransportReplicationAction#ReplicaResponse which includes the replica's new persisted local checkpoint and its last-synced global checkpoint. We only fall back on an explicit sync when the indexing activity finishes.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.