We run a small 4-node cluster (2 data nodes) which was lately upgraded to 6.5.4, without apparent problems. It only has about 100 indices, 5 shards each, 1 replica per index, so a total of roughly 1000 shards.
Yesterday evening there was some Linux OS patching and all nodes were rebooted round-robin. About 2 hours after the reboot, and after the cluster was green again, one data node crashed with a StackOverflow expection. Some seconds before the crash that node had an error about "Too many open files". After restarting, suddenly the other data node showed the same behaviour, and also crashed, and so on it went.
I had to increase the maximum number of files to 200'000 (was at 65'000 before), and only after that the cluster managed to get in sync again. I checked the data directory on the nodes with a quick find -type f | wc -l and found out that there are a whopping ~165'000 files in the data directory, which could explain the Too many open files.
But I find it very surprising that there is such a high number of files for the relatively small number of shards. The whole cluster has less than 100 mio documents.
Does this high number of files make sense? Or did we configure something wrong when doing the 6.5.4 upgrade? What could affect this number of files?
It seems a little high but not ridiculously so. Each shard is made of segments, and each segment might comprise many files, so 500 shards * 10 segments per shard * 20 files per segment would add up to 100,000 files.
You can see the segments that make up each shard with the segments API.
You probably have too many too-small shards. Here is an article with some recommendations about shard sizes:
That sounds like an error that is worth investigating further. Can you share the corresponding stack trace?
I just had it happen again, both data nodes crashed, this time with ulimit set to 200'000. Now this is definitely wierd because there's only ~165'000 files in the whole directory!
I'll anyway start to reduce the shards and rebuild indices. The stacktrace was
[2019-03-21T21:03:56,675][WARN ][o.e.i.e.Engine ] [lxleukxxx] [xxxx][0] failed to read latest segment infos on flush
java.nio.file.FileSystemException: /<path>/elastic/data/nodes/0/indices/lGLWyTDoQfKvm2hXQPtLuw/0/index: Too many open files
at sun.nio.fs.UnixException.translateToIOException(UnixException.java:91) ~[?:?]
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) ~[?:?]
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107) ~[?:?]
at sun.nio.fs.UnixFileSystemProvider.newDirectoryStream(UnixFileSystemProvider.java:427) ~[?:?]
at java.nio.file.Files.newDirectoryStream(Files.java:457) ~[?:1.8.0_71]
at org.apache.lucene.store.FSDirectory.listAll(FSDirectory.java:215) ~[lucene-core-7.5.0.jar:7.5.0 b5bf70b7e32d7ddd9742cc821d471c5fabd4e3df - jimczi - 2018-09-18 13:01:13]
at org.apache.lucene.store.FSDirectory.listAll(FSDirectory.java:234) ~[lucene-core-7.5.0.jar:7.5.0 b5bf70b7e32d7ddd9742cc821d471c5fabd4e3df - jimczi - 2018-09-18 13:01:13]
at org.apache.lucene.store.FilterDirectory.listAll(FilterDirectory.java:58) ~[lucene-core-7.5.0.jar:7.5.0 b5bf70b7e32d7ddd9742cc821d471c5fabd4e3df - jimczi - 2018-09-18 13:01:13]
at org.apache.lucene.store.FilterDirectory.listAll(FilterDirectory.java:58) ~[lucene-core-7.5.0.jar:7.5.0 b5bf70b7e32d7ddd9742cc821d471c5fabd4e3df - jimczi - 2018-09-18 13:01:13]
at org.elasticsearch.index.store.Store.failIfCorrupted(Store.java:598) ~[elasticsearch-6.5.4.jar:6.5.4]
at org.elasticsearch.index.store.Store.failIfCorrupted(Store.java:594) ~[elasticsearch-6.5.4.jar:6.5.4]
at org.elasticsearch.index.store.Store.readLastCommittedSegmentsInfo(Store.java:184) ~[elasticsearch-6.5.4.jar:6.5.4]
at org.elasticsearch.index.engine.InternalEngine.refreshLastCommittedSegmentInfos(InternalEngine.java:1765) [elasticsearch-6.5.4.jar:6.5.4]
at org.elasticsearch.index.engine.InternalEngine.flush(InternalEngine.java:1736) [elasticsearch-6.5.4.jar:6.5.4]
at org.elasticsearch.index.shard.IndexShard.flush(IndexShard.java:1051) [elasticsearch-6.5.4.jar:6.5.4]
at org.elasticsearch.action.admin.indices.flush.TransportShardFlushAction.shardOperationOnPrimary(TransportShardFlushAction.java:54) [elasticsearch-6.5.4.jar:6.5.4]
at org.elasticsearch.action.admin.indices.flush.TransportShardFlushAction.shardOperationOnPrimary(TransportShardFlushAction.java:35) [elasticsearch-6.5.4.jar:6.5.4]
at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryShardReference.perform(TransportReplicationAction.java:1022) [elasticsearch-6.5.4.jar:6.5.4]
at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryShardReference.perform(TransportReplicationAction.java:1000) [elasticsearch-6.5.4.jar:6.5.4]
at org.elasticsearch.action.support.replication.ReplicationOperation.execute(ReplicationOperation.java:102) [elasticsearch-6.5.4.jar:6.5.4]
at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.onResponse(TransportReplicationAction.java:356) [elasticsearch-6.5.4.jar:6.5.4]
at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.onResponse(TransportReplicationAction.java:296) [elasticsearch-6.5.4.jar:6.5.4]
at org.elasticsearch.action.support.replication.TransportReplicationAction$1.onResponse(TransportReplicationAction.java:963) [elasticsearch-6.5.4.jar:6.5.4]
at org.elasticsearch.action.support.replication.TransportReplicationAction$1.onResponse(TransportReplicationAction.java:960) [elasticsearch-6.5.4.jar:6.5.4]
at org.elasticsearch.index.shard.IndexShardOperationPermits.acquire(IndexShardOperationPermits.java:271) [elasticsearch-6.5.4.jar:6.5.4]
at org.elasticsearch.index.shard.IndexShardOperationPermits.acquire(IndexShardOperationPermits.java:238) [elasticsearch-6.5.4.jar:6.5.4]
at org.elasticsearch.index.shard.IndexShard.acquirePrimaryOperationPermit(IndexShard.java:2327) [elasticsearch-6.5.4.jar:6.5.4]
at org.elasticsearch.action.support.replication.TransportReplicationAction.acquirePrimaryShardReference(TransportReplicationAction.java:972) [elasticsearch-6.5.4.jar:6.5.4]
at org.elasticsearch.action.support.replication.TransportReplicationAction.access$500(TransportReplicationAction.java:97) [elasticsearch-6.5.4.jar:6.5.4]
at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.doRun(TransportReplicationAction.java:317) [elasticsearch-6.5.4.jar:6.5.4]
And again, it crashed with a StackOverflow, here's the stack of that
[2019-03-21T21:03:56,775][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [lxleukxxx] fatal error in thread [elasticsearch[lxleukxxx][flush][T#20]], exiting
java.lang.StackOverflowError: null
at org.elasticsearch.common.io.stream.StreamOutput.writeBoolean(StreamOutput.java:428) ~[elasticsearch-6.5.4.jar:6.5.4]
at org.elasticsearch.common.io.stream.StreamOutput.writeException(StreamOutput.java:829) ~[elasticsearch-6.5.4.jar:6.5.4]
at org.elasticsearch.common.io.stream.StreamOutput.writeException(StreamOutput.java:950) ~[elasticsearch-6.5.4.jar:6.5.4]
at org.elasticsearch.ElasticsearchException.writeStackTraces(ElasticsearchException.java:743) ~[elasticsearch-6.5.4.jar:6.5.4]
at org.elasticsearch.common.io.stream.StreamOutput.writeException(StreamOutput.java:952) ~[elasticsearch-6.5.4.jar:6.5.4]
at org.elasticsearch.common.io.stream.StreamOutput.writeException(StreamOutput.java:950) ~[elasticsearch-6.5.4.jar:6.5.4]
at org.elasticsearch.ElasticsearchException.writeStackTraces(ElasticsearchException.java:743) ~[elasticsearch-6.5.4.jar:6.5.4]
and so on. I don't have the full trace, since I'm feeding this log into another ES instance, and the message size is limited due to Kafka. I can logon to the machine to get the beginning of the stack if that helps you
Hmm this does seem wrong. Did the crash coincide with any client activity? Do you know if the number of open files was growing steadily thoughout the lifetime of the process or did it suddenly spike?
Any other log messages that might indicate impeding doom?
Yes please. It looks like it was trying to write out a very tall stack of exceptions, which suggests that perhaps two exception handlers kept throwing the same exception back and forth. Unfortunately I don't think we'll find out what the actual exception was, unless it was being logged each time it was re-thrown. Was it?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.