[ Frequent OOME on coordinator node ]

Hi,
Since some 5 days, my coordinating nodes are both crashing frequenty with a OOME.

I have tried to gather some usefull information to help investigating the problem; but I can't figure out what is wrong.

I'm not really sure what to check first; so I'm trying to find some help here :slight_smile:
That's strange because I didn't find any corelable change in the configuration or in the data volume (maybe I missed it thought)

Here are some information to help investigating:

Problem:
My 2 coordinator nodes are crashing with a OOME

Logs:
[2018-04-17T15:14:51,535][WARN ][i.n.c.AbstractChannelHandlerContext] An exception 'java.lang.OutOfMemoryError: Java heap space' [enable DEBUG level for full stacktrace] was thrown by a user handler's exceptionCaught() method while handling the following exception:
java.lang.OutOfMemoryError: Java heap space
[2018-04-17T15:14:53,038][ERROR][o.e.x.m.c.n.NodeStatsCollector] [coordinator1] collector [node_stats] timed out when collecting data
[2018-04-17T15:14:46,205][WARN ][i.n.c.AbstractChannelHandlerContext] An exception 'java.lang.OutOfMemoryError: Java heap space' [enable DEBUG level for full stacktrace] was thrown by a user handler's exceptionCaught() method while handling the following exception:
java.lang.OutOfMemoryError: Java heap space
[2018-04-17T15:14:58,029][ERROR][o.e.t.n.Netty4Utils ] fatal error on the network layer
at org.elasticsearch.transport.netty4.Netty4Utils.maybeDie(Netty4Utils.java:185)
at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.exceptionCaught(Netty4MessageChannelHandler.java:73)
at io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:285)
at io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:264)
at io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:256)
at io.netty.channel.ChannelInboundHandlerAdapter.exceptionCaught(ChannelInboundHandlerAdapter.java:131)
at io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:285)
at io.netty.channel.AbstractChannelHandlerContext.notifyHandlerException(AbstractChannelHandlerContext.java:850)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:364)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
at io.netty.channel.ChannelInboundHandlerAdapter.channelRead(ChannelInboundHandlerAdapter.java:86)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:241)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1334)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:926)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:134)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:644)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:544)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:498)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:458)
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
at java.base/java.lang.Thread.run(Thread.java:844)
[2018-04-17T15:14:48,801][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [coordinator1] fatal error in thread [Thread-7], exiting
java.lang.OutOfMemoryError: Java heap space

Metrics:

  • Indices : 2093 (daily based)
  • Total shards : 4184
  • ES version : 6.1.1
  • 3 master nodes : 4 cpu / 8 GB ram
  • 4 data nodes : 16 cpu / 32 GB ram
  • 2 coordinator nodes : 2 cpu / 4 GB ram

Heap dump analysis:

The heap dump analysis is quite the same on both coordinator nodes

Thank you for your help
Best regards
Jérôme

Which version is it?

4184 shards for 4 data nodes seem a bit too much.

Hi,

The ES version is: 6.1.1

Could you upgrade?

OK, I will upgrade to the lastest version and keep this topic up to date.

Thanks for your help

Regards

Just to keep you up to date:

I have rolling-ugraded my ES cluster to 6.2.4

Operations have ended at ~ 01:30 AM

For now, it seems the Heap of my coordinators nodes is doing well:

image

I will give some news tomorrow

Best regards

Hi,

Since the upgrade; the HEAP of my coordinator nodes is much more stable.

image

So I think we can consider this issue as closed.

Many thanks for your help

Best regards
Jérôme

Great.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.