ES 5.2.2 data node crash (OutOfMemoryError)

Hi, I have 4 cold elastic search data nodes that some time receive OutOfMemoryError. It seems to be related to moving indexes to these nodes. The cold nodes spec is:
8 Cores, 32 GB (16 assigned to JVM) 5 spinning disks 2TB each.
Each cold node curuntly has 3TB of data and 800 shards

Form the memory dump it seems like there are many threads stuck in:

elasticsearch[esdatacold2-prod][generic][T#9] [DAEMON] State: WAITING tid: 221
sun.misc.Unsafe.park(boolean, long) Unsafe.java
java.util.concurrent.locks.LockSupport.parkNanos(Object, long) LockSupport.java:215
java.util.concurrent.LinkedTransferQueue.awaitMatch(LinkedTransferQueue$Node, LinkedTransferQueue$Node, Object, boolean, long) LinkedTransferQueue.java:734
java.util.concurrent.LinkedTransferQueue.xfer(Object, boolean, int, long) LinkedTransferQueue.java:647
java.util.concurrent.LinkedTransferQueue.poll(long, TimeUnit) LinkedTransferQueue.java:1277
java.util.concurrent.ThreadPoolExecutor.getTask() ThreadPoolExecutor.java:1073
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker) ThreadPoolExecutor.java:1134
java.util.concurrent.ThreadPoolExecutor$Worker.run() ThreadPoolExecutor.java:624
java.lang.Thread.run() Thread.java:748

How much data does it hold? How many indices/shards? How much heap do you have assigned?

Thanks for the quick reply!!! Sorry for the missing information I accidentally pushed post :blush:
Adding the info

How much data does each of the cold nodes hold? How many shards do you have on each of them?

Each cold node curuntly has 3TB of data and 800 shards

Based on the general recommendations in this blog post, that sounds like a lot of shards given how much heap you have available on each node. In order to be able to hold more data on the cold nodes, you may want to use the shrink index API to reduce the number of shards before you relocate the indices to the cold nodes. You can also benefit from minimising the number of segments by forcemerging indices that are no longer written to.

Hi Christian,

I understand that the node may be holding more shards than recommended, but as the blog post you linked says there is no fixed limit enforced by Elasticsearch. I can see why this will affect performance, but this should not cause a node to crash. Receiving an OutOfMemoryError means there is some fault in memory / thread-pool handling. The following is the full stack-stace of the thread that received the exception:

elasticsearch[esdatacold2-prod][[transport_server_worker.default]][T#14] [DAEMON] <--- OutOfMemoryError happened in this thread State: RUNNABLE tid: 60
java.lang.OutOfMemoryError.<init>() OutOfMemoryError.java:48
io.netty.buffer.PoolArena$HeapArena.newUnpooledChunk(int) PoolArena.java:661
io.netty.buffer.PoolArena.allocateHuge(PooledByteBuf, int) PoolArena.java:246
io.netty.buffer.PoolArena.allocate(PoolThreadCache, PooledByteBuf, int) PoolArena.java:224
io.netty.buffer.PoolArena.reallocate(PooledByteBuf, int, boolean) PoolArena.java:372
io.netty.buffer.PooledByteBuf.capacity(int) PooledByteBuf.java:120
io.netty.buffer.AbstractByteBuf.ensureWritable0(int) AbstractByteBuf.java:284
io.netty.buffer.AbstractByteBuf.ensureWritable(int) AbstractByteBuf.java:265
io.netty.buffer.AbstractByteBuf.writeBytes(ByteBuf, int, int) AbstractByteBuf.java:1068
io.netty.buffer.AbstractByteBuf.writeBytes(ByteBuf, int) AbstractByteBuf.java:1060
io.netty.buffer.AbstractByteBuf.writeBytes(ByteBuf) AbstractByteBuf.java:1050
io.netty.handler.codec.ByteToMessageDecoder$1.cumulate(ByteBufAllocator, ByteBuf, ByteBuf) ByteToMessageDecoder.java:92
io.netty.handler.codec.ByteToMessageDecoder.channelRead(ChannelHandlerContext, Object) ByteToMessageDecoder.java:246
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(Object) AbstractChannelHandlerContext.java:363
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext, Object) AbstractChannelHandlerContext.java:349
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(Object) AbstractChannelHandlerContext.java:341
io.netty.channel.ChannelInboundHandlerAdapter.channelRead(ChannelHandlerContext, Object) ChannelInboundHandlerAdapter.java:86
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(Object) AbstractChannelHandlerContext.java:363
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext, Object) AbstractChannelHandlerContext.java:349
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(Object) AbstractChannelHandlerContext.java:341
io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(ChannelHandlerContext, Object) DefaultChannelPipeline.java:1334
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(Object) AbstractChannelHandlerContext.java:363
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext, Object) AbstractChannelHandlerContext.java:349
io.netty.channel.DefaultChannelPipeline.fireChannelRead(Object) DefaultChannelPipeline.java:926
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read() AbstractNioByteChannel.java:129
io.netty.channel.nio.NioEventLoop.processSelectedKey(SelectionKey, AbstractNioChannel) NioEventLoop.java:642
io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(Set) NioEventLoop.java:527
io.netty.channel.nio.NioEventLoop.processSelectedKeys() NioEventLoop.java:481
io.netty.channel.nio.NioEventLoop.run() NioEventLoop.java:441
io.netty.util.concurrent.SingleThreadEventExecutor$5.run() SingleThreadEventExecutor.java:858
java.lang.Thread.run() Thread.java:748

The crashing node causes severe performance effects on the whole cluster, and this should definitely be protected against.

Is there anything we can do about this issue?

1 Like

Do you have monitoring installed so you can see how the heap used varies over time?

What type of queries/aggregations/dashboards are you running against the data stored on these nodes?

The segments and shards held on a node will use up heap space. The more you have and the less these are optimised, the less heap space is left over for querying. Elasticsearch has circuit breakers that try to stop queries that could cause OOM, but they are apparently not able to catch all scenarios. If you want to reduce the chance of nodes going OOM, you may need to tune them. This will however lead to queries failing.

Looking at all the threads call stacks in the dump, it seems that process didn't do anything related to queries (see here).
It seems to be related to the index rerouting that is taking place during the time the process crashed.

Briefly, we create snapshots from each index to backup the data, and then move the indices to these cold nodes by using index filtering by node name. The crash seems to correlate with the time this script runs.
Also, experimenting with the exact same architecture only with SSD disks in the cold machines - which works perfectly.
Only when moving a lot of shards to these HDD machines does the process crash for out of memory.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.