Elasticsearch 6.2.2 nodes crash after reaching ulimit setting

iamredlus · March 15, 2018, 8:00pm

Hi,

We're experiencing a critical production issue in elasticsearch 6.2.2 related to open_file_descriptors. The cluster is an exact replica (as much as possible) of a 5.2.2 cluster, and documents are indexed into both clusters in parallel.
While the indexing performance of the new cluster seems to be at least as good as the 5.2.2 cluster, the new nodes open_file_descriptors is reaching record-breaking levels (especially when compared to v5.2.2).

All machines have ulimit of 65536, as recommended by the official documentation.
All nodes on the v5.2.2 cluster have up to 4,500 open_file_descripors, while the new v6.2.2 nodes are divided: some have up to 4,500 open_file_descriptors, while others consistently open more and more file descriptors until reaching the limit and crashing with java.nio.file.FileSystemException: Too many open files -

[WARN ][o.e.c.a.s.ShardStateAction] [prod-elasticsearch-master-002] [newlogs_20180315-01][0] received shard failed for shard id [[newlogs_20180315-01][0]], allocation id [G8NGOPNHRNuqNKYKzfiPcg], primary term [0], message [shard failure, reason [already closed by tragic event on the translog]], failure [FileSystemException[/mnt/nodes/0/indices/nIgarkzwRwe0DmT-nmLhvg/0/translog/translog.ckp: Too many open files]]
java.nio.file.FileSystemException: /mnt/nodes/0/indices/nIgarkzwRwe0DmT-nmLhvg/0/translog/translog.ckp: Too many open files

After this exception, some of the nodes throw many exceptions and reduce the number of open file descriptors. Other times they just crash. The issue repeats itself with interleaving nodes.

I'd be happy to provide additional details, whatever is needed.

Thanks!

nhat · March 16, 2018, 3:15pm

@iamredlus It will be super helpful for us to diagnose the issue if you can provide the shard level _stats. You can get it via GET /_stats?include_segment_file_sizes&level=shards. Thank you.

Ariel_Assaraf · March 16, 2018, 3:35pm

Hey @nhat, any news reg the files @iamredlus sent you?

nhat · March 16, 2018, 3:53pm

@Ariel_Assaraf I haven't received anything yet.

farin99 · March 16, 2018, 4:12pm

Just forward you the email with the attachments.
Please confirm you received it
Thanks!

nhat · March 16, 2018, 4:30pm

Not yet. My email is first name dot last name at elastic.co

nhat · March 17, 2018, 10:32pm

For future readers:

The root cause is that one replica in the user's cluster got in an infinite flushing loop. We helped the user to resolve the issue by rebuilding replica.

LLin · March 20, 2018, 8:36am

Hello,

I'm facing a similar issue with Elasticsearch after upgrading to 6.2.2 from 5.6.4.
The number of open files goes to a number that is not reasonable and the cluster node crashes.

The culprit seems to be that a very large number of .tlog files is created for some indices:

for example:

java 53621 elasticsearch *767r REG 253,10 43 1086962675 /data/elasticsearch/nodes/0/indices/91x35hspTdSfPy84E4cUwQ/6/translog/translog-100105.tlog

This index has around 120k .tlog files and it's a primary shard.
Currently, the only way I've found to get rid of all the files is to use Cluster Reroute to move the shard to a different server

nhat · March 20, 2018, 12:06pm

@LLin
Would you please share the shard-level stats of that index (/{index}/_stats?level=shards). You can email me at firstname dot lastname at elastic.co. Thank you!

iamredlus · March 20, 2018, 2:47pm

This seems to be a deeper issue than just rebuilding the replicas -

A pull request with a fix is being discussed here:

I hope this will be merged soon and we can test this in our production as well.

imehl · March 22, 2018, 7:17am

Hi,

I have the same issues after upgrading from 5.6.4 to 6.2.2.

@nhat How can I stabilize my cluster based on the shards _stats until an offical fix is ready?

Edit: If I look at /proc/ES-PID/fd the most files (of over 100.000) are ...indices/AB564m6dTgOWBf7gEqvBiw/translog... -> Now I know the problematic index.. What should I do with this index?

Edit 2: I removed the replica of this index and the file descriptor count drops from 130.000 to 8.000

Thanks

iamredlus · April 9, 2018, 9:35am

Commit #29125 fixes this issue. Still unavailable on the latest elasticsearch version, but can be built from the branch:

system · May 7, 2018, 9:35am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
On ES 6.2.4+ (v6.2.4,v6.3.0) file descriptors continue to grow as index grows Elasticsearch	7	826	July 17, 2018
Too Many Open Files Elasticsearch	4	1622	July 6, 2017
Too Many Open Files - Already set max files Elasticsearch	5	4333	July 5, 2017
Failed to read latest segment infos on flush Elasticsearch	10	1901	July 5, 2017
ES Keeps Falling Over Elasticsearch	5	496	July 6, 2017

Elasticsearch 6.2.2 nodes crash after reaching ulimit setting

Related topics