ES 5.2.2 data node crash (OutOfMemoryError)

(Yoni Farin) #1

Hi, I have 4 cold elastic search data nodes that some time receive OutOfMemoryError. It seems to be related to moving indexes to these nodes. The cold nodes spec is:
8 Cores, 32 GB (16 assigned to JVM) 5 spinning disks 2TB each.
Each cold node curuntly has 3TB of data and 800 shards

Form the memory dump it seems like there are many threads stuck in:

elasticsearch[esdatacold2-prod][generic][T#9] [DAEMON] State: WAITING tid: 221
sun.misc.Unsafe.park(boolean, long)
java.util.concurrent.locks.LockSupport.parkNanos(Object, long)
java.util.concurrent.LinkedTransferQueue.awaitMatch(LinkedTransferQueue$Node, LinkedTransferQueue$Node, Object, boolean, long)
java.util.concurrent.LinkedTransferQueue.xfer(Object, boolean, int, long)
java.util.concurrent.LinkedTransferQueue.poll(long, TimeUnit)

(Christian Dahlqvist) #2

How much data does it hold? How many indices/shards? How much heap do you have assigned?

(Yoni Farin) #3

Thanks for the quick reply!!! Sorry for the missing information I accidentally pushed post :blush:
Adding the info

(Christian Dahlqvist) #4

How much data does each of the cold nodes hold? How many shards do you have on each of them?

(Yoni Farin) #5

Each cold node curuntly has 3TB of data and 800 shards

(Christian Dahlqvist) #6

Based on the general recommendations in this blog post, that sounds like a lot of shards given how much heap you have available on each node. In order to be able to hold more data on the cold nodes, you may want to use the shrink index API to reduce the number of shards before you relocate the indices to the cold nodes. You can also benefit from minimising the number of segments by forcemerging indices that are no longer written to.

(Lior Redlus) #7

Hi Christian,

I understand that the node may be holding more shards than recommended, but as the blog post you linked says there is no fixed limit enforced by Elasticsearch. I can see why this will affect performance, but this should not cause a node to crash. Receiving an OutOfMemoryError means there is some fault in memory / thread-pool handling. The following is the full stack-stace of the thread that received the exception:

elasticsearch[esdatacold2-prod][[transport_server_worker.default]][T#14] [DAEMON] <--- OutOfMemoryError happened in this thread State: RUNNABLE tid: 60
io.netty.buffer.PoolArena.allocateHuge(PooledByteBuf, int)
io.netty.buffer.PoolArena.allocate(PoolThreadCache, PooledByteBuf, int)
io.netty.buffer.PoolArena.reallocate(PooledByteBuf, int, boolean)
io.netty.buffer.AbstractByteBuf.writeBytes(ByteBuf, int, int)
io.netty.buffer.AbstractByteBuf.writeBytes(ByteBuf, int)
io.netty.handler.codec.ByteToMessageDecoder$1.cumulate(ByteBufAllocator, ByteBuf, ByteBuf)
io.netty.handler.codec.ByteToMessageDecoder.channelRead(ChannelHandlerContext, Object), Object), Object), Object)$HeadContext.channelRead(ChannelHandlerContext, Object), Object)$, AbstractNioChannel)

The crashing node causes severe performance effects on the whole cluster, and this should definitely be protected against.

Is there anything we can do about this issue?

(Christian Dahlqvist) #8

Do you have monitoring installed so you can see how the heap used varies over time?

What type of queries/aggregations/dashboards are you running against the data stored on these nodes?

The segments and shards held on a node will use up heap space. The more you have and the less these are optimised, the less heap space is left over for querying. Elasticsearch has circuit breakers that try to stop queries that could cause OOM, but they are apparently not able to catch all scenarios. If you want to reduce the chance of nodes going OOM, you may need to tune them. This will however lead to queries failing.

(Yoni Farin) #9

Looking at all the threads call stacks in the dump, it seems that process didn't do anything related to queries (see here).
It seems to be related to the index rerouting that is taking place during the time the process crashed.

Briefly, we create snapshots from each index to backup the data, and then move the indices to these cold nodes by using index filtering by node name. The crash seems to correlate with the time this script runs.
Also, experimenting with the exact same architecture only with SSD disks in the cold machines - which works perfectly.
Only when moving a lot of shards to these HDD machines does the process crash for out of memory.

(system) #10

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.