Hi Team,
We have a elasticsearch cluster with 4 nodes, each node having total ~16 GB memory
ES service memory assigned as below -
export ES_HEAP_SIZE=10180m
export ES_JAVA_OPTS="-Xms10180m -Xmx10180m"
ES version - 6.4.1
Disk Total for cluster - 4TB
There are total ~1724 indices each assigned 5 primary shards and 1 set of replicas
From past couple of months intermittently ES process on nodes shuts down with out of memory errors, detailed stackstrace as below
[2019-05-12T11:57:54,687][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [***] fatal error in thread [Thread-4117498], exiting
java.lang.OutOfMemoryError: Java heap space
at io.netty.util.internal.PlatformDependent.allocateUninitializedArray(PlatformDependent.java:200) ~[?:?]
at io.netty.buffer.PoolArena$HeapArena.newByteArray(PoolArena.java:676) ~[?:?]
at io.netty.buffer.PoolArena$HeapArena.newChunk(PoolArena.java:686) ~[?:?]
at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:244) ~[?:?]
at io.netty.buffer.PoolArena.allocate(PoolArena.java:226) ~[?:?]
at io.netty.buffer.PoolArena.reallocate(PoolArena.java:397) ~[?:?]
at io.netty.buffer.PooledByteBuf.capacity(PooledByteBuf.java:118) ~[?:?]
at io.netty.buffer.AbstractByteBuf.ensureWritable0(AbstractByteBuf.java:285) ~[?:?]
at io.netty.buffer.AbstractByteBuf.ensureWritable(AbstractByteBuf.java:265) ~[?:?]
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1077) ~[?:?]
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1070) ~[?:?]
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1060) ~[?:?]
at io.netty.handler.codec.ByteToMessageDecoder$1.cumulate(ByteToMessageDecoder.java:92) ~[?:?]
at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:263) ~[?:?]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) ~[?:?]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) ~[?:?]
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) ~[?:?]
at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:241) ~[?:?]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) ~[?:?]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) ~[?:?]
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) ~[?:?]
at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1359) ~[?:?]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) ~[?:?]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) ~[?:?]
at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:935) ~[?:?]
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:134) ~[?:?]
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:645) ~[?:?]
at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:545) ~[?:?]
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:499) ~[?:?]
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459) ~[?:?]
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) ~[?:?]
at java.lang.Thread.run(Thread.java:745) [?:1.8.0_111]
As a workaround we have been deleting the replica shards which turns the health back to green.
There is script to delete the 30 days old indices and often we see this issue coming when there are more shards
Could you please suggest if this is mainly due to more number of shards or is there any other fine tuning we can do to resolve the issue?
Health Command O/p for reference when cluster health is red
[ ~]$ curl 127.0.0.1:8080/_cat/health?v
epoch timestamp cluster status node.total node.data shards pri relo init unassign pending_tasks max_task_wait_time active_shards_percent
1559019521 21:58:41 digger red 2 2 7226 5463 0 4 4934 4 2.5s 59.4%