We're experiencing a critical production issue in elasticsearch 6.2.2 related to open_file_descriptors. The cluster is an exact replica (as much as possible) of a 5.2.2 cluster, and documents are indexed into both clusters in parallel.
While the indexing performance of the new cluster seems to be at least as good as the 5.2.2 cluster, the new nodes open_file_descriptors is reaching record-breaking levels (especially when compared to v5.2.2).
All machines have ulimit of 65536, as recommended by the official documentation.
All nodes on the v5.2.2 cluster have up to 4,500 open_file_descripors, while the new v6.2.2 nodes are divided: some have up to 4,500 open_file_descriptors, while others consistently open more and more file descriptors until reaching the limit and crashing with java.nio.file.FileSystemException: Too many open files -
[WARN ][o.e.c.a.s.ShardStateAction] [prod-elasticsearch-master-002] [newlogs_20180315-01][0] received shard failed for shard id [[newlogs_20180315-01][0]], allocation id [G8NGOPNHRNuqNKYKzfiPcg], primary term [0], message [shard failure, reason [already closed by tragic event on the translog]], failure [FileSystemException[/mnt/nodes/0/indices/nIgarkzwRwe0DmT-nmLhvg/0/translog/translog.ckp: Too many open files]]
java.nio.file.FileSystemException: /mnt/nodes/0/indices/nIgarkzwRwe0DmT-nmLhvg/0/translog/translog.ckp: Too many open files
After this exception, some of the nodes throw many exceptions and reduce the number of open file descriptors. Other times they just crash. The issue repeats itself with interleaving nodes.
I'd be happy to provide additional details, whatever is needed.
@iamredlus It will be super helpful for us to diagnose the issue if you can provide the shard level _stats. You can get it via GET /_stats?include_segment_file_sizes&level=shards. Thank you.
The root cause is that one replica in the user's cluster got in an infinite flushing loop. We helped the user to resolve the issue by rebuilding replica.
I'm facing a similar issue with Elasticsearch after upgrading to 6.2.2 from 5.6.4.
The number of open files goes to a number that is not reasonable and the cluster node crashes.
The culprit seems to be that a very large number of .tlog files is created for some indices:
This index has around 120k .tlog files and it's a primary shard.
Currently, the only way I've found to get rid of all the files is to use Cluster Reroute to move the shard to a different server
@LLin
Would you please share the shard-level stats of that index (/{index}/_stats?level=shards). You can email me at firstname dot lastname at elastic.co. Thank you!
I have the same issues after upgrading from 5.6.4 to 6.2.2.
@nhat How can I stabilize my cluster based on the shards _stats until an offical fix is ready?
Edit: If I look at /proc/ES-PID/fd the most files (of over 100.000) are ...indices/AB564m6dTgOWBf7gEqvBiw/translog... -> Now I know the problematic index.. What should I do with this index?
Edit 2: I removed the replica of this index and the file descriptor count drops from 130.000 to 8.000
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.