We have a problem with ES 5.5.0 today about 12 hours after upgrading from 5.2.3. The upgrade was OK and the cluster was performing OK for many hours before the elected master crashed. Five hours later we restarted the failed node and this triggered all cluster nodes to immediately crash with OOM.
01:27 -- node1 master left. failed to perform indicies:data/write/bulk[s] on replica [logstash-2017.07.09][2] ... NodeDisconnectedException
01:27 -- cannot connect to node1. New master elected node3.
05:21 -- restarted node1. master node3 left cluster.
05:21 -- All nodes report "Force-closing a channel error.." and OOM
I'm gathering logs now and will include in a subsequent post.
These are the logs when node 1 crashed and before it was restarted.
Restarting node1 caused the rest of the cluster to immediately crash. All nodes were then restarted and then node1 again crashed. The rest of the cluster remains active at this moment with node1 not running.
indent preformatted text by 4 spaces
[2017-07-09T01:27:24,748][ERROR][o.e.i.e.Engine ] [node1] [logstash-2017.07.09][3] tragic event in index writer
org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed
at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:749) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:763) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
(...)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_131]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_131]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131]
Caused by: java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method) ~[?:1.8.0_131]
at java.lang.Thread.start(Thread.java:717) ~[?:1.8.0_131]
at org.apache.lucene.index.ConcurrentMergeScheduler.merge(ConcurrentMergeScheduler.java:553) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
at org.apache.lucene.index.IndexWriter.maybeMerge(IndexWriter.java:2083) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
(...)
at org.elasticsearch.index.IndexService$BaseAsyncTask.run(IndexService.java:743) ~[elasticsearch-5.5.0.jar:5.5.0]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:569) ~[elasticsearch-5.5.0.jar:5.5.0]
... 3 more
[2017-07-09T01:27:24,732][ERROR][o.e.i.e.Engine ] [node1] [logstash-2017.07.09][3] tragic event in index writer
org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed
at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:749) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:763) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
(...)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_131]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131]
Caused by: java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method) ~[?:1.8.0_131]
(...)
----- repeat errors -----
[2017-07-09T01:27:24,767][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [node1] fatal error in thread [elasticsearch[node1][bulk][T#297]], exiting
java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method) ~[?:1.8.0_131]
(...)
While you probably aren't mean to be malicious, simply linking to a random file on the internet can come across as suspicious. It'd be great if you could pastebin them too
That aside are you using a custom path.data value?
I shutdown all nodes except node3 and just now I find java has used all memory. I killed the process and reinstalled oracle-java on all nodes after purging the existing install. Should the cluster crash again then in absence of other suggestions I'm going to try downgrading Java.
Is there a way to tell what Lucene version each shard is using ? I'm wondering if the background upgrade process somehow contributed to the problem as new data was being written to the daily index during the rolling upgrade.
What does your cluster topology look like? Do you only have 2 master eligible nodes? Best practice is to have 3 as that allows the cluster to continue operating fully even if one of the master nodes go down.
The cluster is restoring at the moment, I'll post an update in a few hours.
It's possible the way I upgraded Java may have contributed to the issue, I had a similar problem with Logstash. Debian lets you install a package over an existing package which is what I did assuming $magic. Turns out this is a bad idea, you should "apt purge $package" first.
Given the problems I had with Logstash I'm wondering if Java was afflicted by the same mistake. I've since reinstalled the package on all nodes.
I didn't downgrade, I reinstalled 1.8.0_131-b11 after purging Java.
When I upgraded Java using "dpkg -i oracle-java8-jdk_8u131_amd64.deb" it overwrote the existing install of oracle-java8. The old and new class files were intermingled together in some random chaotic way that it caused the JVM to fail with thread and memory management.
The fix was to purge Java from the system and reinstall.
Seems like I was wrong, the entire cluster crashed again.
Here is the error, again it starts with node1 failing. I think it crashed after the cluster rebalance was finished. Yesterday the cluster had healed itself by balancing shards without node1. After a full cluster restart the entire day was spent relocating shards to node1 as a result of the previous day.
I'm assuming once that was completed this error happens. Seems to suggest something with node1 but at the same time there's something to be said about the entire cluster crashing because of what happens on node1.
[2017-07-10T21:26:52,430][ERROR][o.e.i.e.Engine ] [node1] [logstash-2017.07.11][9] tragic event in index writer
org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed
at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:749) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:763) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1567) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1316) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
at org.elasticsearch.index.engine.InternalEngine.index(InternalEngine.java:661) [elasticsearch-5.5.0.jar:5.5.0]
at org.elasticsearch.index.engine.InternalEngine.indexIntoLucene(InternalEngine.java:605) [elasticsearch-5.5.0.jar:5.5.0]
at org.elasticsearch.index.engine.InternalEngine.index(InternalEngine.java:505) [elasticsearch-5.5.0.jar:5.5.0]
at org.elasticsearch.index.shard.IndexShard.index(IndexShard.java:556) [elasticsearch-5.5.0.jar:5.5.0]
at org.elasticsearch.index.shard.IndexShard.index(IndexShard.java:545) [elasticsearch-5.5.0.jar:5.5.0]
at org.elasticsearch.action.bulk.TransportShardBulkAction.executeIndexRequestOnReplica(TransportShardBulkAction.java:441) [elasticsearch-5.5.0.jar:5.5.0]
at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnReplica(TransportShardBulkAction.java:376) [elasticsearch-5.5.0.jar:5.5.0]
at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnReplica(TransportShardBulkAction.java:69) [elasticsearch-5.5.0.jar:5.5.0]
at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncReplicaAction.onResponse(TransportReplicationAction.java:494) [elasticsearch-5.5.0.jar:5.5.0]
at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncReplicaAction.onResponse(TransportReplicationAction.java:467) [elasticsearch-5.5.0.jar:5.5.0]
at org.elasticsearch.index.shard.IndexShardOperationsLock.acquire(IndexShardOperationsLock.java:147) [elasticsearch-5.5.0.jar:5.5.0]
at org.elasticsearch.index.shard.IndexShard.acquireReplicaOperationLock(IndexShard.java:1673) [elasticsearch-5.5.0.jar:5.5.0]
at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncReplicaAction.doRun(TransportReplicationAction.java:566) [elasticsearch-5.5.0.jar:5.5.0]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-5.5.0.jar:5.5.0]
at org.elasticsearch.action.support.replication.TransportReplicationAction$ReplicaOperationTransportHandler.messageReceived(TransportReplicationAction.java:451) [elasticsearch-5.5.0.jar:5.5.0]
at org.elasticsearch.action.support.replication.TransportReplicationAction$ReplicaOperationTransportHandler.messageReceived(TransportReplicationAction.java:441) [elasticsearch-5.5.0.jar:5.5.0]
at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69) [elasticsearch-5.5.0.jar:5.5.0]
at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1544) [elasticsearch-5.5.0.jar:5.5.0]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638) [elasticsearch-5.5.0.jar:5.5.0]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-5.5.0.jar:5.5.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_131]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_131]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131]
Caused by: java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method) ~[?:1.8.0_131]
at java.lang.Thread.start(Thread.java:717) ~[?:1.8.0_131]
at org.apache.lucene.index.ConcurrentMergeScheduler.merge(ConcurrentMergeScheduler.java:553) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
at org.apache.lucene.index.IndexWriter.maybeMerge(IndexWriter.java:2083) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
at org.apache.lucene.index.IndexWriter.doAfterSegmentFlushed(IndexWriter.java:5005) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
at org.apache.lucene.index.DocumentsWriter$MergePendingEvent.process(DocumentsWriter.java:731) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
at org.apache.lucene.index.IndexWriter.processEvents(IndexWriter.java:5043) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
at org.apache.lucene.index.IndexWriter.processEvents(IndexWriter.java:5034) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1574) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
... 24 more
[2017-07-10T21:26:52,429][ERROR][o.e.i.e.Engine ] [node1] [logstash-2017.07.11][9] tragic event in index writer
org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed
I am going to try downgrading Java to 112. If that doesn't solve the problem then I will consider our data lost and downgrade ES back to 5.4.0. When I do so I will use a different path.data and retain the previous indexes should a fix arise in the next few weeks.
If this is identified as a fixable bug we could switch versions to test a patch on the old data and if it works I think including the new path.data would allow the new data to merge into the existing cluster.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.