Elasticsearch (6.4.1) - JVM OutOfMemoryError

Sarika · May 29, 2019, 6:03am

Hi Team,

We have a elasticsearch cluster with 4 nodes, each node having total ~16 GB memory

ES service memory assigned as below - 
export ES_HEAP_SIZE=10180m
export ES_JAVA_OPTS="-Xms10180m -Xmx10180m"

ES version - 6.4.1
Disk Total for cluster - 4TB
There are total ~1724 indices each assigned 5 primary shards and 1 set of replicas

From past couple of months intermittently ES process on nodes shuts down with out of memory errors, detailed stackstrace as below

[2019-05-12T11:57:54,687][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [***] fatal error in thread [Thread-4117498], exiting
java.lang.OutOfMemoryError: Java heap space
        at io.netty.util.internal.PlatformDependent.allocateUninitializedArray(PlatformDependent.java:200) ~[?:?]
        at io.netty.buffer.PoolArena$HeapArena.newByteArray(PoolArena.java:676) ~[?:?]
        at io.netty.buffer.PoolArena$HeapArena.newChunk(PoolArena.java:686) ~[?:?]
        at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:244) ~[?:?]
        at io.netty.buffer.PoolArena.allocate(PoolArena.java:226) ~[?:?]
        at io.netty.buffer.PoolArena.reallocate(PoolArena.java:397) ~[?:?]
        at io.netty.buffer.PooledByteBuf.capacity(PooledByteBuf.java:118) ~[?:?]
        at io.netty.buffer.AbstractByteBuf.ensureWritable0(AbstractByteBuf.java:285) ~[?:?]
        at io.netty.buffer.AbstractByteBuf.ensureWritable(AbstractByteBuf.java:265) ~[?:?]
        at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1077) ~[?:?]
        at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1070) ~[?:?]
        at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1060) ~[?:?]
        at io.netty.handler.codec.ByteToMessageDecoder$1.cumulate(ByteToMessageDecoder.java:92) ~[?:?]
        at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:263) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) ~[?:?]
        at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:241) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) ~[?:?]
        at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1359) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) ~[?:?]
        at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:935) ~[?:?]
        at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:134) ~[?:?]
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:645) ~[?:?]
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:545) ~[?:?]
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:499) ~[?:?]
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459) ~[?:?]
        at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) ~[?:?]
        at java.lang.Thread.run(Thread.java:745) [?:1.8.0_111]

As a workaround we have been deleting the replica shards which turns the health back to green.
There is script to delete the 30 days old indices and often we see this issue coming when there are more shards

Could you please suggest if this is mainly due to more number of shards or is there any other fine tuning we can do to resolve the issue?

Health Command O/p for reference when cluster health is red

[ ~]$ curl 127.0.0.1:8080/_cat/health?v
epoch      timestamp cluster status node.total node.data shards  pri relo init unassign pending_tasks max_task_wait_time active_shards_percent
1559019521 21:58:41  digger  red             2         2   7226 5463    0    4     4934             4               2.5s                 59.4%

warkolm · May 29, 2019, 6:06am

Welcome!

The short answer is you are oversharded, look at using _shrink to reduce the number of shards and then update your templates.

Sarika · May 29, 2019, 6:26am

Thanks for very quick reply!
I will explore around _shrink option as suggested

Could you please suggest, approx how many shards should cluster of 4 nodes with 10 GB heap memory allocated to each node have?
Is there any ideal configuration for number of shards vs. memory allocated?

warkolm · May 29, 2019, 6:29am

A few hundred at most.

There's a few blog posts on this that can help you, check them out.

Christian_Dahlqvist · May 29, 2019, 6:44am

10GB is dormant use cases with time-based indices a good minimum shard size to aim for. Given that you have 4TB of data that would mean 400 shards in total - -100 per node.

system · June 26, 2019, 6:50am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Out of memory error in elasticsearch Elasticsearch	2	2702	June 21, 2021
Crashed with out of memory Elasticsearch	7	864	July 5, 2017
Elasticsearch 2.3.3 encountered outofmemory Elasticsearch	11	1222	July 5, 2017
Elasticsearch cluster down due to Elasticsearch:java.lang.OutOfMemoryError: Java heap space Elasticsearch	4	413	June 25, 2019
Recurring Heap Problems Elasticsearch	2	403	July 6, 2017

Elasticsearch (6.4.1) - JVM OutOfMemoryError

Related topics