Org.elasticsearch.index.engine.EngineClosedException

Hello,

one week ago we upgraded our production cluster (running ES 1.7.5) by adding 2 new data nodes to the 2 existing data node. We also have 3 indexing/search nodes.

After adding the 2 new nodes (which are exactly the same Linux centos as the existing ones) and after the shards rebalanced them selves we started to monitor the nodes with HQ plugin.
The 2 "old" nodes stayed stable but the 2 new ones started to behave differently:

and now in the log file we keep getting the exception below:

[2016-03-31 08:57:46,375][WARN ][indices.ttl ] [MSA-ES-CLUSTER_NODE_DATA4] failed to execute ttl purge
org.elasticsearch.index.engine.EngineClosedException: [ubilogs-15.3.2][4] CurrentState[CLOSED]
at org.elasticsearch.index.engine.Engine.ensureOpen(Engine.java:289)
at org.elasticsearch.index.engine.Engine.acquireSearcher(Engine.java:277)
at org.elasticsearch.index.shard.IndexShard.acquireSearcher(IndexShard.java:797)
at org.elasticsearch.index.shard.IndexShard.acquireSearcher(IndexShard.java:792)
at org.elasticsearch.indices.ttl.IndicesTTLService.purgeShards(IndicesTTLService.java:205)
at org.elasticsearch.indices.ttl.IndicesTTLService.access$000(IndicesTTLService.java:69)
at org.elasticsearch.indices.ttl.IndicesTTLService$PurgerThread.run(IndicesTTLService.java:141)

Any idea that could help ?

Antoine

just a follow up:
we have decided to use the index.routing.allocation.include.tag and index.routing.allocation.exclude.tag to move all shards to the 2 data nodes that were not having any errors and fortunately the unindexed logs were buffered in our log injection system so we could index them (around 25G of data) and not loose any customer data.

Now our cluster is working with only 2 data node out of four and we would like to understand why 2 node started to fail.

A lot to look at.
The statement "High values indicate slow I/O" would make me look at the server hardware and OS first. (esp since your other nodes are ok with indices etc.)

i would start with the OS.
Run TOP and check RAM and iotop to look at I/O.

You may want to run sar (sysutils) on the server (linux) and see if it finds anything growing over time.

In addition i would check that ES_HEAP_SIZe = 1/2 RAM.

Is the storage configured optimal? Storage configured same as other servers in the cluster?
NIC, check teaming, Auto speed and duplex settings etc. Storage on SAN, is the HBA set up properly?

And /etc/fstab settings can cause I/O issues. (search about NOATIME in fstab and i/o performance)

Hope this helps.

You are using TTL as well as time based indices, why bother?

Well actually we are not using time based indices so we are using TTL.
But we are definitively going to change our strategy and use time based indices (daily most probably).
This
should allow us to delete old data more efficiently and reduce the
amount of resource used for search by only searching on the relevant
indices.
Plus ES v2.X will not support TTL anymore...

That being said, I'm still trying to understand the reason behind the exception org.elasticsearch.index.engine.EngineClosedException.
It happened only on one of the for data node but the whole indexing process was failing.

Antoine

Well actually we are not using time based indices so we are using TTL.
But we are definitively going to change our strategy and use time based
indices (daily most probably).
This should allow us to delete old data more efficiently and reduce the
amount of resource used for search by only searching on the relevant
indices.
Plus ES v2.X will not support TTL anymore...

That being said, I'm still trying to understand the reason behind the
exception org.elasticsearch.index.engine.EngineClosedException.
It happened only on one of the for data node but the whole indexing process
was failing.

Antoine

It looks like that index was closed, hence the error.

actually when the error happened, the index wasn't closed (at least it did show as closed by plugins such as head or kopf)

Hello,

we did more investigation on the issue root cause and it looks like we hit a OOM exception.
What I don't understand is that the breaker mechanism was triggered but still the node crashed.

Is there any best practice to detect or avoid this kind of situation?

Antoine

[2016-03-28 05:51:13,027][WARN ][indices.breaker ] [MSA-ES-CLUSTER_NODE_DATA4] [FIELDDATA] New used memory 9246158010 [8.6gb] from field [rawlog] would be larger than configured breaker: 8995025715 [8.3gb], breaking
[2016-03-28 05:51:26,455][WARN ][indices.breaker ] [MSA-ES-CLUSTER_NODE_DATA4] [FIELDDATA] New used memory 9412680187 [8.7gb] from field [rawlog] would be larger than configured breaker: 8995025715 [8.3gb], breaking
[2016-03-28 05:54:32,695][WARN ][transport.netty ] [MSA-ES-CLUSTER_NODE_DATA4] exception caught on transport layer [[id: 0x5c59bc0f, /10.250.132.220:55946 => /10.250.132.221:9300]], closing connection
java.lang.OutOfMemoryError: Java heap space
at org.elasticsearch.common.compress.BufferRecycler.allocInputBuffer(BufferRecycler.java:119)
at org.elasticsearch.common.compress.lzf.LZFCompressedStreamInput.(LZFCompressedStreamInput.java:47)
at org.elasticsearch.common.compress.lzf.LZFCompressor.streamInput(LZFCompressor.java:121)
at org.elasticsearch.common.io.stream.CachedStreamInput.cachedHandlesCompressed(CachedStreamInput.java:69)
at org.elasticsearch.transport.netty.MessageChannelHandler.messageReceived(MessageChannelHandler.java:107)
at org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
at org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
at org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:296)
at org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.unfoldAndFireMessageReceived(FrameDecoder.java:462)
at org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.callDecode(FrameDecoder.java:443)
at org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.messageReceived(FrameDecoder.java:303)
at org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
at org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
at org.elasticsearch.common.netty.OpenChannelsHandler.handleUpstream(OpenChannelsHandler.java:74)
at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559)
at org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:268)
at org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:255)
at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88)
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108)
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:337)
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
at org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
at org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
[2016-03-28 05:54:32,733][WARN ][index.shard ] [MSA-ES-CLUSTER_NODE_DATA4] [ubilogs-15.3.2][4] Failed to perform scheduled engine refresh
org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed
at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:700)
at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:714)
at org.apache.lucene.index.IndexWriter.nrtIsCurrent(IndexWriter.java:4716)
at org.apache.lucene.index.StandardDirectoryReader.isCurrent(StandardDirectoryReader.java:362)

Do you have any earlier OOMEs in your logs? Lucene's IndexWriter will defensively close itself if an OOME comes through it, so maybe that's how your IW was closed.

Maybe the breaker settings are too lenient? The goal of the circuit breaker is to prevent OOMEs from happening ...