Some questions with regards to index.translog.durability's interaction with replica shards

From documentation, I see

"async" - fsync and commit in the background every sync_interval. In the event of hardware failure, all acknowledged writes since the last automatic commit will be discarded.

So a few questions:

  1. How does this interact with consistency levels? Is quorum reached even though the items have not been flushed to disk? Does quorum just mean "a replica received it but not necessarily flushed it"?

  2. What happens when you have several replicas, some of them see the change, but the primary shard crashes or is otherwise brought down unsafely? Do we still have data loss, or is the write recovered from one of the replica shards and when the primary comes back it catches up with the new elected primary and recovers?

  3. Bonus optional question - let's say you have 1,000 writes per second, and you do them 1 individual index operation at a time with durability set to async. On the other hand, let's say you do 1,000 writes per second in bulk writes with durability set to request. Ignoring consistency concerns, is there a significant performance benefit one way or the other, or would it mostly perform similarly?

Thanks!! We have a cluster where we currently have high throughput of individual writes, and performance is increased significantly by switching to async (or switching to bulk ingestion, which requires additional work).

The answers here will help us decide how to move forward. Thanks!

Bump - not sure about bump etiquette here so let me know if it's not cool.

One more attempt.

I will refer to the "consistency" settings as "write resiliency" settings. Note that in <= 2.4 these were referred to as "write consistency" and in >= 5.0 these are referred to as "wait for active shards".

The translog durability setting does not interact with the write resiliency settings. Note that the write resiliency settings are pre-flight checks. When you specify to wait for N active shards (defaults to 1 since 5.0), the primary will wait for there to be N active shard copies available before proceeding with a write (or timeout). For a write to be acknowledged, it must be successfully acknowledged by all shards that were active at the start of the request (after the pre-flight check has passed). The translog durability settings are about when the translog is fsynced after a write has completed on an individual shard copy. That is, these two settings have no interplay.

This is a difficult question to answer. Since so much has changed here in recent history, I will only cover the current state in >= 6.0.

In a situation when a write is successfully replicated to some but not all replicas, there is indeed a chance for the shard copies to become out of sync. One of the shard copies will be promoted to primary, and this shard will initiate a re-sync with the remaining replicas. The re-sync will replay operations above a global "safety marker" that the newly promoted primary had received, and it will also fill in any gaps in its history with no-ops (operations that it missed from the previous primary), to the remaining shard copies. By definition, these writes were never acknowledged (because they are above the global safety marker). This synchronizes the translogs of the shard copies. However, it still leaves the possibility that there is an operation in the Lucene index on a shard copy that is not in the Lucene index on the newly promoted primary. We are actively working to address this and it will be fixed soon. When the previous primary returns, it will recover from the newly promoted primary, also synchronizing it with the other shard copies.

Prior to 6.0, repairing from this situation required a full recovery, a process that would not be triggered until a shard copy goes offline and comes back, or goes offline and is reassigned elsewhere.

Are you assuming writes to a single shard here? The answer depends on how frequently the translog is being fsynced in the async scenario. If the frequency is > 1s, I would expect the individual writes to perform better (less fsyncs). If the frequency is < 1s, I would expect the single bulk write to perform better (less fsyncs). There's some overhead from the individual writes that makes this not quite precise but that's roughly how I would be thinking about this. Of course, I would measure to know for sure since this is heavily dependent on I/O performance, etc. You would have to play with different frequencies to find the true boundary.

6 Likes

Awesome set of answers that are incredibly helpful and detailed, thank you!

I'll review these with our team. We are currently in 2.4, but are hoping to do some migratory hops to 6.x soon. I'll see if I can find documentation about the full recovery/offline process you talked about. It is not the end of the world if we lose some data in our cluster, but it is best if it eventually recovers, and I'm wondering when that recovery process might occur in our cluster as it is currently configured.

The fact that you're working on improving this particular error path in 6.x (and it is already better handled) is really good to know.

Also, for the bonus question that's what we were guessing (that fsync quantity was the dominant operation there) but it's good to have that more or less confirmed. We'll do some measuring as well as you suggested.

Thank you again for your time and assistance!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.