[ Frequent OOME on coordinator node ]

jerome831361 · April 18, 2018, 1:10pm

Hi,
Since some 5 days, my coordinating nodes are both crashing frequenty with a OOME.

I have tried to gather some usefull information to help investigating the problem; but I can't figure out what is wrong.

I'm not really sure what to check first; so I'm trying to find some help here
That's strange because I didn't find any corelable change in the configuration or in the data volume (maybe I missed it thought)

Here are some information to help investigating:

Problem:
My 2 coordinator nodes are crashing with a OOME

Logs:
[2018-04-17T15:14:51,535][WARN ][i.n.c.AbstractChannelHandlerContext] An exception 'java.lang.OutOfMemoryError: Java heap space' [enable DEBUG level for full stacktrace] was thrown by a user handler's exceptionCaught() method while handling the following exception:
java.lang.OutOfMemoryError: Java heap space
[2018-04-17T15:14:53,038][ERROR][o.e.x.m.c.n.NodeStatsCollector] [coordinator1] collector [node_stats] timed out when collecting data
[2018-04-17T15:14:46,205][WARN ][i.n.c.AbstractChannelHandlerContext] An exception 'java.lang.OutOfMemoryError: Java heap space' [enable DEBUG level for full stacktrace] was thrown by a user handler's exceptionCaught() method while handling the following exception:
java.lang.OutOfMemoryError: Java heap space
[2018-04-17T15:14:58,029][ERROR][o.e.t.n.Netty4Utils ] fatal error on the network layer
at org.elasticsearch.transport.netty4.Netty4Utils.maybeDie(Netty4Utils.java:185)
at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.exceptionCaught(Netty4MessageChannelHandler.java:73)
at io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:285)
at io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:264)
at io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:256)
at io.netty.channel.ChannelInboundHandlerAdapter.exceptionCaught(ChannelInboundHandlerAdapter.java:131)
at io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:285)
at io.netty.channel.AbstractChannelHandlerContext.notifyHandlerException(AbstractChannelHandlerContext.java:850)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:364)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
at io.netty.channel.ChannelInboundHandlerAdapter.channelRead(ChannelInboundHandlerAdapter.java:86)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:241)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1334)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:926)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:134)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:644)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:544)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:498)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:458)
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
at java.base/java.lang.Thread.run(Thread.java:844)
[2018-04-17T15:14:48,801][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [coordinator1] fatal error in thread [Thread-7], exiting
java.lang.OutOfMemoryError: Java heap space

Metrics:

Indices : 2093 (daily based)
Total shards : 4184
ES version : 6.1.1
3 master nodes : 4 cpu / 8 GB ram
4 data nodes : 16 cpu / 32 GB ram
2 coordinator nodes : 2 cpu / 4 GB ram

Heap dump analysis:

The heap dump analysis is quite the same on both coordinator nodes

Thank you for your help
Best regards
Jérôme

dadoonet · April 18, 2018, 1:56pm

Which version is it?

4184 shards for 4 data nodes seem a bit too much.

jerome831361 · April 18, 2018, 2:16pm

Hi,

The ES version is: 6.1.1

dadoonet · April 18, 2018, 6:49pm

Could you upgrade?

jerome831361 · April 19, 2018, 7:38am

OK, I will upgrade to the lastest version and keep this topic up to date.

Thanks for your help

Regards

jerome831361 · April 19, 2018, 12:21pm

Just to keep you up to date:

I have rolling-ugraded my ES cluster to 6.2.4

Operations have ended at ~ 01:30 AM

For now, it seems the Heap of my coordinators nodes is doing well:

I will give some news tomorrow

Best regards

jerome831361 · April 20, 2018, 7:15am

Hi,

Since the upgrade; the HEAP of my coordinator nodes is much more stable.

So I think we can consider this issue as closed.

Many thanks for your help

Best regards
Jérôme

dadoonet · April 20, 2018, 12:51pm

Great.

system · May 18, 2018, 12:51pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
ES 6.2.4 Coordinator node sudden OutOfMemory Elasticsearch	1	1646	June 8, 2018
Elasticsearch going down in cordinator node because of JVM heap space issue Elasticsearch	12	1386	January 23, 2020
Regarding Coordinator Node Memory Usage Elasticsearch	2	1095	October 5, 2018
Getting OOME's (Out of Memory Exceptions) to stop Elasticsearch	3	1765	July 6, 2017
Regarding Coordinator Node Heap Usage Elasticsearch	26	3308	November 28, 2018

[ Frequent OOME on coordinator node ]

Here are some information to help investigating:

Related topics