Elasticsearch 6.2.2 nodes crash after reaching ulimit setting

Hi,

We're experiencing a critical production issue in elasticsearch 6.2.2 related to open_file_descriptors. The cluster is an exact replica (as much as possible) of a 5.2.2 cluster, and documents are indexed into both clusters in parallel.
While the indexing performance of the new cluster seems to be at least as good as the 5.2.2 cluster, the new nodes open_file_descriptors is reaching record-breaking levels (especially when compared to v5.2.2).

All machines have ulimit of 65536, as recommended by the official documentation.
All nodes on the v5.2.2 cluster have up to 4,500 open_file_descripors, while the new v6.2.2 nodes are divided: some have up to 4,500 open_file_descriptors, while others consistently open more and more file descriptors until reaching the limit and crashing with java.nio.file.FileSystemException: Too many open files -

[WARN ][o.e.c.a.s.ShardStateAction] [prod-elasticsearch-master-002] [newlogs_20180315-01][0] received shard failed for shard id [[newlogs_20180315-01][0]], allocation id [G8NGOPNHRNuqNKYKzfiPcg], primary term [0], message [shard failure, reason [already closed by tragic event on the translog]], failure [FileSystemException[/mnt/nodes/0/indices/nIgarkzwRwe0DmT-nmLhvg/0/translog/translog.ckp: Too many open files]]
java.nio.file.FileSystemException: /mnt/nodes/0/indices/nIgarkzwRwe0DmT-nmLhvg/0/translog/translog.ckp: Too many open files

After this exception, some of the nodes throw many exceptions and reduce the number of open file descriptors. Other times they just crash. The issue repeats itself with interleaving nodes.

I'd be happy to provide additional details, whatever is needed.

Thanks!

1 Like

@iamredlus It will be super helpful for us to diagnose the issue if you can provide the shard level _stats. You can get it via GET /_stats?include_segment_file_sizes&level=shards. Thank you.

Hey @nhat, any news reg the files @iamredlus sent you?

@Ariel_Assaraf I haven't received anything yet.

Just forward you the email with the attachments.
Please confirm you received it :slight_smile:
Thanks!

Not yet. My email is first name dot last name at elastic.co

For future readers:

The root cause is that one replica in the user's cluster got in an infinite flushing loop. We helped the user to resolve the issue by rebuilding replica.

Hello,

I'm facing a similar issue with Elasticsearch after upgrading to 6.2.2 from 5.6.4.
The number of open files goes to a number that is not reasonable and the cluster node crashes.

The culprit seems to be that a very large number of .tlog files is created for some indices:

for example:

java 53621 elasticsearch *767r REG 253,10 43 1086962675 /data/elasticsearch/nodes/0/indices/91x35hspTdSfPy84E4cUwQ/6/translog/translog-100105.tlog

This index has around 120k .tlog files and it's a primary shard.
Currently, the only way I've found to get rid of all the files is to use Cluster Reroute to move the shard to a different server

2 Likes

@LLin
Would you please share the shard-level stats of that index (/{index}/_stats?level=shards). You can email me at firstname dot lastname at elastic.co. Thank you!

This seems to be a deeper issue than just rebuilding the replicas -

A pull request with a fix is being discussed here:

I hope this will be merged soon and we can test this in our production as well.

Hi,

I have the same issues after upgrading from 5.6.4 to 6.2.2.

@nhat How can I stabilize my cluster based on the shards _stats until an offical fix is ready?

Edit: If I look at /proc/ES-PID/fd the most files (of over 100.000) are ...indices/AB564m6dTgOWBf7gEqvBiw/translog... -> Now I know the problematic index.. What should I do with this index?

Edit 2: I removed the replica of this index and the file descriptor count drops from 130.000 to 8.000 :slight_smile:

Thanks

Commit #29125 fixes this issue. Still unavailable on the latest elasticsearch version, but can be built from the branch:

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.