Elasticsearch 5.5.0 cluster crash -- elasticsearch.yml processors set too high

Hello.

We have a problem with ES 5.5.0 today about 12 hours after upgrading from 5.2.3. The upgrade was OK and the cluster was performing OK for many hours before the elected master crashed. Five hours later we restarted the failed node and this triggered all cluster nodes to immediately crash with OOM.

01:27 -- node1 master left. failed to perform indicies:data/write/bulk[s] on replica [logstash-2017.07.09][2] ... NodeDisconnectedException

01:27 -- cannot connect to node1. New master elected node3.

05:21 -- restarted node1. master node3 left cluster.
05:21 -- All nodes report "Force-closing a channel error.." and OOM

I'm gathering logs now and will include in a subsequent post.

These are the logs when node 1 crashed and before it was restarted.

Restarting node1 caused the rest of the cluster to immediately crash. All nodes were then restarted and then node1 again crashed. The rest of the cluster remains active at this moment with node1 not running.

indent preformatted text by 4 spaces

[2017-07-09T01:27:24,748][ERROR][o.e.i.e.Engine ] [node1] [logstash-2017.07.09][3] tragic event in index writer
org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed
at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:749) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:763) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
(...)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_131]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_131]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131]
Caused by: java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method) ~[?:1.8.0_131]
at java.lang.Thread.start(Thread.java:717) ~[?:1.8.0_131]
at org.apache.lucene.index.ConcurrentMergeScheduler.merge(ConcurrentMergeScheduler.java:553) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
at org.apache.lucene.index.IndexWriter.maybeMerge(IndexWriter.java:2083) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
(...)
at org.elasticsearch.index.IndexService$BaseAsyncTask.run(IndexService.java:743) ~[elasticsearch-5.5.0.jar:5.5.0]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:569) ~[elasticsearch-5.5.0.jar:5.5.0]
... 3 more

[2017-07-09T01:27:24,732][ERROR][o.e.i.e.Engine ] [node1] [logstash-2017.07.09][3] tragic event in index writer
org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed
at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:749) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:763) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
(...)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_131]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131]
Caused by: java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method) ~[?:1.8.0_131]
(...)

----- repeat errors -----

[2017-07-09T01:27:24,767][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [node1] fatal error in thread [elasticsearch[node1][bulk][T#297]], exiting
java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method) ~[?:1.8.0_131]
(...)

indent preformatted text by 4 spaces

Correction.. we upgraded from 5.4.0 to 5.5.0 and node1 was the first node to be upgraded.

Is it possible to delete a single replica ? Maybe the jump to Lucene 6.5.1 somehow corrupted the index (shard in this case).

What JVM and what OS?
Can you post the entire log from that node, and from the master at the time of the wider problems.

Java 1.8.0_131-b11. Debian Jessie.

There's a 7000 character limit, where do I upload the entire log ? Pastebin ?

I left node1 shutdown, after the cluster healed itself it immediately crashed with the same phenomenon.

Pastebin is fine.

Active master (node3) ____ https://filebin.ca/3SpwVlLwvn3C
Crashed master (node1) __ https://filebin.ca/3Spx1botlEUF

While you probably aren't mean to be malicious, simply linking to a random file on the internet can come across as suspicious. It'd be great if you could pastebin them too :slight_smile:

That aside are you using a custom path.data value?

Ok.

I had them confused, node1 was the first to crash and was also the active master at the time.

Node1 (active master) ___ https://pastebin.com/jNt2gDSe
Node3 (backup master) ___ https://pastebin.com/1R2ydQeB

Yes, it is a custom path.data value.

I shutdown all nodes except node3 and just now I find java has used all memory. I killed the process and reinstalled oracle-java on all nodes after purging the existing install. Should the cluster crash again then in absence of other suggestions I'm going to try downgrading Java.

Is there a way to tell what Lucene version each shard is using ? I'm wondering if the background upgrade process somehow contributed to the problem as new data was being written to the daily index during the rolling upgrade.

What does your cluster topology look like? Do you only have 2 master eligible nodes? Best practice is to have 3 as that allows the cluster to continue operating fully even if one of the master nodes go down.

I edited my previous post to include some more info.

There are 8 nodes total. 6 are data nodes and 2 are index/search nodes. 3 of the data nodes are masters.

masters: {node1,node2,node3}
data: {node1,node2,node3,node4,node5,node6}
index/search: {index_node1, index_node2}

The problem isn't the topology in this case, it's that the master election process seems to be triggering an all node crash.

The cluster is restoring at the moment, I'll post an update in a few hours.

It's possible the way I upgraded Java may have contributed to the issue, I had a similar problem with Logstash. Debian lets you install a package over an existing package which is what I did assuming $magic. Turns out this is a bad idea, you should "apt purge $package" first.

Given the problems I had with Logstash I'm wondering if Java was afflicted by the same mistake. I've since reinstalled the package on all nodes.

I think it's the custom path.data, it's something we've seen in the past, from memory, but pretty sure was resolved?

If your talking about 23981 I think maybe I was lucky to have skipped it.

I previously upgraded from 5.2 to 5.4 and that bug seems exclusive to 5.3.x.

The cluster has been stable for about 30 minutes so I've resumed indexing, seems it was the Java install. Thanks all for the help.

What JVM did you downgrade to?

I didn't downgrade, I reinstalled 1.8.0_131-b11 after purging Java.

When I upgraded Java using "dpkg -i oracle-java8-jdk_8u131_amd64.deb" it overwrote the existing install of oracle-java8. The old and new class files were intermingled together in some random chaotic way that it caused the JVM to fail with thread and memory management.

The fix was to purge Java from the system and reinstall.

1 Like

Seems like I was wrong, the entire cluster crashed again.

Here is the error, again it starts with node1 failing. I think it crashed after the cluster rebalance was finished. Yesterday the cluster had healed itself by balancing shards without node1. After a full cluster restart the entire day was spent relocating shards to node1 as a result of the previous day.

I'm assuming once that was completed this error happens. Seems to suggest something with node1 but at the same time there's something to be said about the entire cluster crashing because of what happens on node1.

[2017-07-10T21:26:52,430][ERROR][o.e.i.e.Engine ] [node1] [logstash-2017.07.11][9] tragic event in index writer
org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed
at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:749) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:763) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1567) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1316) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
at org.elasticsearch.index.engine.InternalEngine.index(InternalEngine.java:661) [elasticsearch-5.5.0.jar:5.5.0]
at org.elasticsearch.index.engine.InternalEngine.indexIntoLucene(InternalEngine.java:605) [elasticsearch-5.5.0.jar:5.5.0]
at org.elasticsearch.index.engine.InternalEngine.index(InternalEngine.java:505) [elasticsearch-5.5.0.jar:5.5.0]
at org.elasticsearch.index.shard.IndexShard.index(IndexShard.java:556) [elasticsearch-5.5.0.jar:5.5.0]
at org.elasticsearch.index.shard.IndexShard.index(IndexShard.java:545) [elasticsearch-5.5.0.jar:5.5.0]
at org.elasticsearch.action.bulk.TransportShardBulkAction.executeIndexRequestOnReplica(TransportShardBulkAction.java:441) [elasticsearch-5.5.0.jar:5.5.0]
at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnReplica(TransportShardBulkAction.java:376) [elasticsearch-5.5.0.jar:5.5.0]
at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnReplica(TransportShardBulkAction.java:69) [elasticsearch-5.5.0.jar:5.5.0]
at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncReplicaAction.onResponse(TransportReplicationAction.java:494) [elasticsearch-5.5.0.jar:5.5.0]
at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncReplicaAction.onResponse(TransportReplicationAction.java:467) [elasticsearch-5.5.0.jar:5.5.0]
at org.elasticsearch.index.shard.IndexShardOperationsLock.acquire(IndexShardOperationsLock.java:147) [elasticsearch-5.5.0.jar:5.5.0]
at org.elasticsearch.index.shard.IndexShard.acquireReplicaOperationLock(IndexShard.java:1673) [elasticsearch-5.5.0.jar:5.5.0]
at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncReplicaAction.doRun(TransportReplicationAction.java:566) [elasticsearch-5.5.0.jar:5.5.0]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-5.5.0.jar:5.5.0]
at org.elasticsearch.action.support.replication.TransportReplicationAction$ReplicaOperationTransportHandler.messageReceived(TransportReplicationAction.java:451) [elasticsearch-5.5.0.jar:5.5.0]
at org.elasticsearch.action.support.replication.TransportReplicationAction$ReplicaOperationTransportHandler.messageReceived(TransportReplicationAction.java:441) [elasticsearch-5.5.0.jar:5.5.0]
at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69) [elasticsearch-5.5.0.jar:5.5.0]
at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1544) [elasticsearch-5.5.0.jar:5.5.0]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638) [elasticsearch-5.5.0.jar:5.5.0]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-5.5.0.jar:5.5.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_131]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_131]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131]
Caused by: java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method) ~[?:1.8.0_131]
at java.lang.Thread.start(Thread.java:717) ~[?:1.8.0_131]
at org.apache.lucene.index.ConcurrentMergeScheduler.merge(ConcurrentMergeScheduler.java:553) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
at org.apache.lucene.index.IndexWriter.maybeMerge(IndexWriter.java:2083) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
at org.apache.lucene.index.IndexWriter.doAfterSegmentFlushed(IndexWriter.java:5005) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
at org.apache.lucene.index.DocumentsWriter$MergePendingEvent.process(DocumentsWriter.java:731) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
at org.apache.lucene.index.IndexWriter.processEvents(IndexWriter.java:5043) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
at org.apache.lucene.index.IndexWriter.processEvents(IndexWriter.java:5034) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1574) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
... 24 more
[2017-07-10T21:26:52,429][ERROR][o.e.i.e.Engine ] [node1] [logstash-2017.07.11][9] tragic event in index writer
org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed

I am going to try downgrading Java to 112. If that doesn't solve the problem then I will consider our data lost and downgrade ES back to 5.4.0. When I do so I will use a different path.data and retain the previous indexes should a fix arise in the next few weeks.

If this is identified as a fixable bug we could switch versions to test a patch on the old data and if it works I think including the new path.data would allow the new data to merge into the existing cluster.

Can you post your config?