Elasticsearch (6.4.1) - JVM OutOfMemoryError

Hi Team,

We have a elasticsearch cluster with 4 nodes, each node having total ~16 GB memory

ES service memory assigned as below - 
export ES_HEAP_SIZE=10180m
export ES_JAVA_OPTS="-Xms10180m -Xmx10180m"

ES version - 6.4.1
Disk Total for cluster - 4TB
There are total ~1724 indices each assigned 5 primary shards and 1 set of replicas

From past couple of months intermittently ES process on nodes shuts down with out of memory errors, detailed stackstrace as below

[2019-05-12T11:57:54,687][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [***] fatal error in thread [Thread-4117498], exiting
java.lang.OutOfMemoryError: Java heap space
        at io.netty.util.internal.PlatformDependent.allocateUninitializedArray(PlatformDependent.java:200) ~[?:?]
        at io.netty.buffer.PoolArena$HeapArena.newByteArray(PoolArena.java:676) ~[?:?]
        at io.netty.buffer.PoolArena$HeapArena.newChunk(PoolArena.java:686) ~[?:?]
        at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:244) ~[?:?]
        at io.netty.buffer.PoolArena.allocate(PoolArena.java:226) ~[?:?]
        at io.netty.buffer.PoolArena.reallocate(PoolArena.java:397) ~[?:?]
        at io.netty.buffer.PooledByteBuf.capacity(PooledByteBuf.java:118) ~[?:?]
        at io.netty.buffer.AbstractByteBuf.ensureWritable0(AbstractByteBuf.java:285) ~[?:?]
        at io.netty.buffer.AbstractByteBuf.ensureWritable(AbstractByteBuf.java:265) ~[?:?]
        at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1077) ~[?:?]
        at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1070) ~[?:?]
        at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1060) ~[?:?]
        at io.netty.handler.codec.ByteToMessageDecoder$1.cumulate(ByteToMessageDecoder.java:92) ~[?:?]
        at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:263) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) ~[?:?]
        at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:241) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) ~[?:?]
        at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1359) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) ~[?:?]
        at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:935) ~[?:?]
        at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:134) ~[?:?]
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:645) ~[?:?]
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:545) ~[?:?]
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:499) ~[?:?]
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459) ~[?:?]
        at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) ~[?:?]
        at java.lang.Thread.run(Thread.java:745) [?:1.8.0_111]

As a workaround we have been deleting the replica shards which turns the health back to green.
There is script to delete the 30 days old indices and often we see this issue coming when there are more shards

Could you please suggest if this is mainly due to more number of shards or is there any other fine tuning we can do to resolve the issue?

Health Command O/p for reference when cluster health is red

[ ~]$ curl 127.0.0.1:8080/_cat/health?v
epoch      timestamp cluster status node.total node.data shards  pri relo init unassign pending_tasks max_task_wait_time active_shards_percent
1559019521 21:58:41  digger  red             2         2   7226 5463    0    4     4934             4               2.5s                 59.4%

Welcome!

The short answer is you are oversharded, look at using _shrink to reduce the number of shards and then update your templates.

Thanks for very quick reply!
I will explore around _shrink option as suggested

Could you please suggest, approx how many shards should cluster of 4 nodes with 10 GB heap memory allocated to each node have?
Is there any ideal configuration for number of shards vs. memory allocated?

A few hundred at most.

There's a few blog posts on this that can help you, check them out.

10GB is dormant use cases with time-based indices a good minimum shard size to aim for. Given that you have 4TB of data that would mean 400 shards in total - -100 per node.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.