One of our clusters is in a bad state where the nodes appear to have thousands of file descriptors open. I found out there are many open translog files since filename translog-1 and now up to translog-2358. Every hour about 8 of these are created. The other clusters in healthy state only typically have one of these files existing.
What might be the cause of this?
Note: no other types of file descriptors are in excess except translog.
The workload is pretty straightforward, just a few replicas spread across a few nodes. Mostly reads. Plenty of CPU/mem. We update the index every 10 minutes. One thing that isn't ideal is that we have some data replicas set on the master nodes. When we created the cluster, we didn't set the master nodes as master-only.
Upon further inspection, it looks like these open filedescriptors started appear after we scaled down our kubernetes cluster (those replicas I mentioned).
I tried to force flush the translog, and while that cleared those open FD's, I see them building up again. Also, using a normal flush didn't seem to do anything.
Still no idea what might be causing this or how to even fix this.
I went ahead and force flushed. The file descriptors cleared, but the cluster still seems to be responding very slowly to some API commands such as _node/stats and _node/etc....taking nearly 1 minute to respond. Other API calls such as search seem fine. I'm guessing the master nodes are still affected and that's the reason.
@Christian_Dahlqvist Any thoughts? I also looked into trying to cancel tasks but none of them are cancellable. These are tasks that are taking up to 30 seconds, but should be taking less than a second.
Maybe this related to #49970 which was closed in 7.7.0. I suggest you upgrade to pick up that fix.
Not sure what too many translog generations might have to do with GET _nodes/stats nor why you think the master nodes are "still affected" (master nodes aren't involved in GET _nodes/stats except sometimes to respond with their own stats just like every other node). If you can reproduce this on the latest version then we can revisit these questions.
@DavidTurner Thanks for sharing that. Looks similar. We will go to the latest release.
Regarding that other issue... I restarted a master node today and I saw lots of the ongoing active tasks (up to 30 seconds for _node/stats, _node/info, etc... on multiple nodes) instantly clear/complete when I rebooted a random master node. I'm guessing it could have been something weird that happened with scaling down the cluster too quickly with Kubernetes HPA or something about the pods setup we didn't configure perfectly.
FYI: I think the restart for that single node also helped clear the file descriptors. However, this time, I don't see the file descriptors accumulating again as I did with force flushing. I think the restart helped us, which might be contrary to the behavior seen in the ticket. A bit of a puzzle imo.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant logo are trademarks of the Apache Software Foundation in the United States and/or other countries.