How to understand translog stats

We are seeing the situation where when we restart ES, there is a long recovery time for the shards. The service is down usually for only about 90 seconds, but the recovery can be an hour or so. During my investigation, I am looking at the [index]/_stats?level=shards api. I see this:

      "translog" : {
        "operations" : 1024765,
        "size_in_bytes" : 46171546377,
        "uncommitted_operations" : 502502,
        "uncommitted_size_in_bytes" : 21484734315,
        "earliest_last_modified_age" : 0
      },

I just want to understand what this is telling me so I can figure out if it is normal or not.

We are using the defaults for translog (v 6.8.8), so I would think the translog is getting flushed with each successful operation. If I look at the above, it is telling me that there is ~21 GB of uncommitted data. If it is getting flushed continuously, shouldn't this be much lower?

The index has 80 primary shards (with 2 replicas). Each shard is ~35-45 GB, and a total index size of 2.9 TB and 5.3 billion docs.

Thanks!

1 Like

The function of translog is to ensure data security and replay the operation(without falling disk) when there is a problem, flush operation will clear the translog

I have a rough idea of what translog is and why it is used. My question is around interpreting the response from the _stats api. If we are not setting anything for index.translog.durability, I am assuming we are set for "request:"

request

(default) fsync and commit after every request. In the event of hardware failure, all acknowledged writes will already have been committed to disk.

Should I expect therefore that the uncommitted_operations is fairly low (as the operation should get flushed on completion)? I am just not certain if having a constant ~20 gb of uncommitted_size_in_bytes is normal, or a sign of an issue.

Again, my goal is to track down why reinitialization of a shard takes an hour when the node was only offline for 90 seconds. If it was incorrectly replaying an inflated translog, that could be the reason.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.