Some questions with regards to index.translog.durability's interaction with replica shards

jasontedor · February 5, 2018, 11:03pm

I will refer to the "consistency" settings as "write resiliency" settings. Note that in <= 2.4 these were referred to as "write consistency" and in >= 5.0 these are referred to as "wait for active shards".

The translog durability setting does not interact with the write resiliency settings. Note that the write resiliency settings are pre-flight checks. When you specify to wait for N active shards (defaults to 1 since 5.0), the primary will wait for there to be N active shard copies available before proceeding with a write (or timeout). For a write to be acknowledged, it must be successfully acknowledged by all shards that were active at the start of the request (after the pre-flight check has passed). The translog durability settings are about when the translog is fsynced after a write has completed on an individual shard copy. That is, these two settings have no interplay.

This is a difficult question to answer. Since so much has changed here in recent history, I will only cover the current state in >= 6.0.

In a situation when a write is successfully replicated to some but not all replicas, there is indeed a chance for the shard copies to become out of sync. One of the shard copies will be promoted to primary, and this shard will initiate a re-sync with the remaining replicas. The re-sync will replay operations above a global "safety marker" that the newly promoted primary had received, and it will also fill in any gaps in its history with no-ops (operations that it missed from the previous primary), to the remaining shard copies. By definition, these writes were never acknowledged (because they are above the global safety marker). This synchronizes the translogs of the shard copies. However, it still leaves the possibility that there is an operation in the Lucene index on a shard copy that is not in the Lucene index on the newly promoted primary. We are actively working to address this and it will be fixed soon. When the previous primary returns, it will recover from the newly promoted primary, also synchronizing it with the other shard copies.

Prior to 6.0, repairing from this situation required a full recovery, a process that would not be triggered until a shard copy goes offline and comes back, or goes offline and is reassigned elsewhere.

Are you assuming writes to a single shard here? The answer depends on how frequently the translog is being fsynced in the async scenario. If the frequency is > 1s, I would expect the individual writes to perform better (less fsyncs). If the frequency is < 1s, I would expect the single bulk write to perform better (less fsyncs). There's some overhead from the individual writes that makes this not quite precise but that's roughly how I would be thinking about this. Of course, I would measure to know for sure since this is heavily dependent on I/O performance, etc. You would have to play with different frequencies to find the true boundary.