Translog files not clearing, each associated with an open file descriptor

hhsuey · May 8, 2020, 8:10pm

One of our clusters is in a bad state where the nodes appear to have thousands of file descriptors open. I found out there are many open translog files since filename translog-1 and now up to translog-2358. Every hour about 8 of these are created. The other clusters in healthy state only typically have one of these files existing.

What might be the cause of this?

Note: no other types of file descriptors are in excess except translog.

-rw-rw-r-- 1 elasticsearch elasticsearch 55 Apr 13 23:37 /usr/share/elasticsearch/data/nodes/0/indices/KUwh4V9BQcOVYa9DQo6KUw/0/translog/translog-1.tlog
-rw-rw-r-- 1 elasticsearch elasticsearch 88 Apr 13 23:37 /usr/share/elasticsearch/data/nodes/0/indices/KUwh4V9BQcOVYa9DQo6KUw/0/translog/translog-1.ckp
-rw-rw-r-- 1 elasticsearch elasticsearch 55 Apr 13 23:37 /usr/share/elasticsearch/data/nodes/0/indices/KUwh4V9BQcOVYa9DQo6KUw/0/translog/translog-2.tlog
-rw-rw-r-- 1 elasticsearch elasticsearch 88 Apr 13 23:37 /usr/share/elasticsearch/data/nodes/0/indices/KUwh4V9BQcOVYa9DQo6KUw/0/translog/translog-2.ckp
-rw-rw-r-- 1 elasticsearch elasticsearch 55 Apr 13 23:37 /usr/share/elasticsearch/data/nodes/0/indices/KUwh4V9BQcOVYa9DQo6KUw/0/translog/translog-3.tlog
-rw-rw-r-- 1 elasticsearch elasticsearch 88 Apr 14 00:08 /usr/share/elasticsearch/data/nodes/0/indices/KUwh4V9BQcOVYa9DQo6KUw/0/translog/translog-3.ckp
-rw-rw-r-- 1 elasticsearch elasticsearch 55 Apr 14 00:08 /usr/share/elasticsearch/data/nodes/0/indices/KUwh4V9BQcOVYa9DQo6KUw/0/translog/translog-4.tlog
-rw-rw-r-- 1 elasticsearch elasticsearch 88 Apr 14 00:15 /usr/share/elasticsearch/data/nodes/0/indices/KUwh4V9BQcOVYa9DQo6KUw/0/translog/translog-4.ckp
-rw-rw-r-- 1 elasticsearch elasticsearch 55 Apr 14 00:15 /usr/share/elasticsearch/data/nodes/0/indices/KUwh4V9BQcOVYa9DQo6KUw/0/translog/translog-5.tlog
-rw-rw-r-- 1 elasticsearch elasticsearch 88 Apr 14 00:15 /usr/share/elasticsearch/data/nodes/0/indices/KUwh4V9BQcOVYa9DQo6KUw/0/translog/translog-5.ckp
-rw-rw-r-- 1 elasticsearch elasticsearch 55 Apr 14 00:15 /usr/share/elasticsearch/data/nodes/0/indices/KUwh4V9BQcOVYa9DQo6KUw/0/translog/translog-6.tlog
-rw-rw-r-- 1 elasticsearch elasticsearch 88 Apr 14 00:16 /usr/share/elasticsearch/data/nodes/0/indices/KUwh4V9BQcOVYa9DQo6KUw/0/translog/translog-6.ckp

Christian_Dahlqvist · May 9, 2020, 8:31am

Which version of Elasticsearch are you using? What does you work load look like?

hhsuey · May 12, 2020, 6:31pm

We are using elastic 7.4.1 packaged from docker.elastic.co.

The workload is pretty straightforward, just a few replicas spread across a few nodes. Mostly reads. Plenty of CPU/mem. We update the index every 10 minutes. One thing that isn't ideal is that we have some data replicas set on the master nodes. When we created the cluster, we didn't set the master nodes as master-only.

Upon further inspection, it looks like these open filedescriptors started appear after we scaled down our kubernetes cluster (those replicas I mentioned).

I tried to force flush the translog, and while that cleared those open FD's, I see them building up again. Also, using a normal flush didn't seem to do anything.

Still no idea what might be causing this or how to even fix this.

hhsuey · May 12, 2020, 6:38pm

Looking at one of the nodes that has many open FDs, I see 0 translog operations, but also 0 uncommited translog operations.

                "translog": {
                    "operations": 0,
                    "size": "31.9kb",
                    "size_in_bytes": 32725,
                    "uncommitted_operations": 0,
                    "uncommitted_size": "31.9kb",
                    "uncommitted_size_in_bytes": 32725,
                    "earliest_last_modified_age": 0
                },

On a node that isn't building up many FD's, we see a positive value for translog operations, but also positive value for uncommited operations.

                "translog": {
                    "operations": 375236,
                    "size": "424.6mb",
                    "size_in_bytes": 445244144,
                    "uncommitted_operations": 375236,
                    "uncommitted_size": "424.6mb",
                    "uncommitted_size_in_bytes": 445244144,
                    "earliest_last_modified_age": 0
                },

hhsuey · May 13, 2020, 4:31pm

I went ahead and force flushed. The file descriptors cleared, but the cluster still seems to be responding very slowly to some API commands such as _node/stats and _node/etc....taking nearly 1 minute to respond. Other API calls such as search seem fine. I'm guessing the master nodes are still affected and that's the reason.

hhsuey · May 18, 2020, 6:34pm

@Christian_Dahlqvist Any thoughts? I also looked into trying to cancel tasks but none of them are cancellable. These are tasks that are taking up to 30 seconds, but should be taking less than a second.

DavidTurner · May 18, 2020, 10:12pm

Maybe this related to #49970 which was closed in 7.7.0. I suggest you upgrade to pick up that fix.

Not sure what too many translog generations might have to do with GET _nodes/stats nor why you think the master nodes are "still affected" (master nodes aren't involved in GET _nodes/stats except sometimes to respond with their own stats just like every other node). If you can reproduce this on the latest version then we can revisit these questions.

hhsuey · May 19, 2020, 3:14am

@DavidTurner Thanks for sharing that. Looks similar. We will go to the latest release.

Regarding that other issue... I restarted a master node today and I saw lots of the ongoing active tasks (up to 30 seconds for _node/stats, _node/info, etc... on multiple nodes) instantly clear/complete when I rebooted a random master node. I'm guessing it could have been something weird that happened with scaling down the cluster too quickly with Kubernetes HPA or something about the pods setup we didn't configure perfectly.

FYI: I think the restart for that single node also helped clear the file descriptors. However, this time, I don't see the file descriptors accumulating again as I did with force flushing. I think the restart helped us, which might be contrary to the behavior seen in the ticket. A bit of a puzzle imo.

system · June 16, 2020, 3:14am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Issue with open files Elasticsearch	4	411	August 5, 2020
Elasticesearch 7.6.2 translog overflow issue Elasticsearch	10	827	December 28, 2020
On ES 6.2.4+ (v6.2.4,v6.3.0) file descriptors continue to grow as index grows Elasticsearch	7	805	July 17, 2018
Open File Handles leaking on 6.2.3 nodes..? Elasticsearch	5	828	May 26, 2018
Why my expired translog files are not deleted? Elasticsearch	5	1066	November 25, 2019

Translog files not clearing, each associated with an open file descriptor

Related topics