How to understand translog stats

jthoni · March 23, 2022, 11:57pm

We are seeing the situation where when we restart ES, there is a long recovery time for the shards. The service is down usually for only about 90 seconds, but the recovery can be an hour or so. During my investigation, I am looking at the [index]/_stats?level=shards api. I see this:

      "translog" : {
        "operations" : 1024765,
        "size_in_bytes" : 46171546377,
        "uncommitted_operations" : 502502,
        "uncommitted_size_in_bytes" : 21484734315,
        "earliest_last_modified_age" : 0
      },

I just want to understand what this is telling me so I can figure out if it is normal or not.

We are using the defaults for translog (v 6.8.8), so I would think the translog is getting flushed with each successful operation. If I look at the above, it is telling me that there is ~21 GB of uncommitted data. If it is getting flushed continuously, shouldn't this be much lower?

The index has 80 primary shards (with 2 replicas). Each shard is ~35-45 GB, and a total index size of 2.9 TB and 5.3 billion docs.

Thanks!

casterQ · March 24, 2022, 7:58am

The function of translog is to ensure data security and replay the operation(without falling disk) when there is a problem, flush operation will clear the translog

jthoni · March 24, 2022, 3:12pm

I have a rough idea of what translog is and why it is used. My question is around interpreting the response from the _stats api. If we are not setting anything for index.translog.durability, I am assuming we are set for "request:"

request

(default) fsync and commit after every request. In the event of hardware failure, all acknowledged writes will already have been committed to disk.

Should I expect therefore that the uncommitted_operations is fairly low (as the operation should get flushed on completion)? I am just not certain if having a constant ~20 gb of uncommitted_size_in_bytes is normal, or a sign of an issue.

Again, my goal is to track down why reinitialization of a shard takes an hour when the node was only offline for 90 seconds. If it was incorrectly replaying an inflated translog, that could be the reason.

system · April 21, 2022, 3:12pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Why translog_ops still greater than 0 after flush Elasticsearch	2	237	February 6, 2023
Translog don't decrease quickly enough Elasticsearch	10	696	January 15, 2020
Elasticesearch 7.6.2 translog overflow issue Elasticsearch	10	827	December 28, 2020
ES 7.5 translog recovery is extremely slow Elasticsearch	15	3445	February 19, 2020
ES 2.1 shards stuck in translog recovery Elasticsearch	14	6050	July 5, 2017

How to understand translog stats

Related topics