[ Frequent OOME on coordinator node ]

Hi,
Since some 5 days, my coordinating nodes are both crashing frequenty with a OOME.

I have tried to gather some usefull information to help investigating the problem; but I can't figure out what is wrong.

I'm not really sure what to check first; so I'm trying to find some help here :slight_smile:
That's strange because I didn't find any corelable change in the configuration or in the data volume (maybe I missed it thought)

Here are some information to help investigating:

Problem:
My 2 coordinator nodes are crashing with a OOME

Logs:
[2018-04-17T15:14:51,535][WARN ][i.n.c.AbstractChannelHandlerContext] An exception 'java.lang.OutOfMemoryError: Java heap space' [enable DEBUG level for full stacktrace] was thrown by a user handler's exceptionCaught() method while handling the following exception:
java.lang.OutOfMemoryError: Java heap space
[2018-04-17T15:14:53,038][ERROR][o.e.x.m.c.n.NodeStatsCollector] [coordinator1] collector [node_stats] timed out when collecting data
[2018-04-17T15:14:46,205][WARN ][i.n.c.AbstractChannelHandlerContext] An exception 'java.lang.OutOfMemoryError: Java heap space' [enable DEBUG level for full stacktrace] was thrown by a user handler's exceptionCaught() method while handling the following exception:
java.lang.OutOfMemoryError: Java heap space
[2018-04-17T15:14:58,029][ERROR][o.e.t.n.Netty4Utils ] fatal error on the network layer
at org.elasticsearch.transport.netty4.Netty4Utils.maybeDie(Netty4Utils.java:185)
at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.exceptionCaught(Netty4MessageChannelHandler.java:73)
at io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:285)
at io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:264)
at io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:256)
at io.netty.channel.ChannelInboundHandlerAdapter.exceptionCaught(ChannelInboundHandlerAdapter.java:131)
at io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:285)
at io.netty.channel.AbstractChannelHandlerContext.notifyHandlerException(AbstractChannelHandlerContext.java:850)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:364)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
at io.netty.channel.ChannelInboundHandlerAdapter.channelRead(ChannelInboundHandlerAdapter.java:86)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:241)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1334)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:926)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:134)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:644)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:544)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:498)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:458)
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
at java.base/java.lang.Thread.run(Thread.java:844)
[2018-04-17T15:14:48,801][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [coordinator1] fatal error in thread [Thread-7], exiting
java.lang.OutOfMemoryError: Java heap space

Metrics:

  • Indices : 2093 (daily based)
  • Total shards : 4184
  • ES version : 6.1.1
  • 3 master nodes : 4 cpu / 8 GB ram
  • 4 data nodes : 16 cpu / 32 GB ram
  • 2 coordinator nodes : 2 cpu / 4 GB ram

Heap dump analysis:

The heap dump analysis is quite the same on both coordinator nodes

Thank you for your help
Best regards
Jérôme

Which version is it?

4184 shards for 4 data nodes seem a bit too much.

Hi,

The ES version is: 6.1.1

Could you upgrade?

OK, I will upgrade to the lastest version and keep this topic up to date.

Thanks for your help

Regards

Just to keep you up to date:

I have rolling-ugraded my ES cluster to 6.2.4

Operations have ended at ~ 01:30 AM

For now, it seems the Heap of my coordinators nodes is doing well:

image

I will give some news tomorrow

Best regards

Hi,

Since the upgrade; the HEAP of my coordinator nodes is much more stable.

image

So I think we can consider this issue as closed.

Many thanks for your help

Best regards
Jérôme

Great.