All data nodes died on the cluster

For the very first time after upgrading to 5.6.8.2 version (in last 3 months), we observed a strange and fatal error on production in elasticsearch cluster.

What happened:
All dedicated data nodes died one after another (total of 8 nodes), most of them complained about JVM heap space. We don't know what could have caused it because everything has been working perfectly fine on the cluster for a very long time.

Elasticsearch Master Logs:

[2019-03-14T17:04:14,174][DEBUG][o.e.a.a.c.n.s.TransportNodesStatsAction] [esmaster1] failed to execute on node [WUS7OWKBQNCRBwD7MhA1yA]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [analyticsengine-esdata-b7-cluster.iad1.qprod.net][10.20.15.30:9300][cluster:monitor/nodes/stats[n]] request_id [121039465] timed out after [15001ms]
        at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:961) [elasticsearch-5.6.8.jar:5.6.8]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:575) [elasticsearch-5.6.8.jar:5.6.8]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_191]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_191]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_191]
[2019-03-14T17:04:14,174][DEBUG][o.e.a.a.c.n.s.TransportNodesStatsAction] [esmaster1] failed to execute on node [0Omy4yGZQ6iyWd72R0rkjQ]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [esdata8][10.24.96.143:9300][cluster:monitor/nodes/stats[n]] request_id [121039467] timed out after [15000ms]
        at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:961) [elasticsearch-5.6.8.jar:5.6.8]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:575) [elasticsearch-5.6.8.jar:5.6.8]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_191]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_191]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_191]
[2019-03-14T17:04:24,201][WARN ][o.e.d.z.PublishClusterStateAction] [esmaster1] timed out waiting for all nodes to process published state [752608] (timeout [30s], pending nodes: [{esdata8}{0Omy4yGZQ6iyWd72R0rkjQ}{ZbNiggbiSeOH1j9YFGhBGQ}{10.24.96.143}{10.24.96.143:9300}])

Elasticsearch Data Node 7 logs:

[2019-03-14T16:54:51,508][INFO ][o.e.m.j.JvmGcMonitorService] [esdata7] [gc][8679918] overhead, spent [384ms] collecting in the last [1s]
[2019-03-14T16:59:41,513][INFO ][o.e.m.j.JvmGcMonitorService] [esdata7] [gc][8680207] overhead, spent [685ms] collecting in the last [1.6s]
[2019-03-14T16:59:43,204][INFO ][o.e.m.j.JvmGcMonitorService] [esdata7] [gc][young][8680208][909685] duration [749ms], collections [1]/[1.6s], total [749ms]/[11.2h],
 memory [17.8gb]->[18gb]/[29.8gb], all_pools {[young] [44mb]->[17.5mb]/[1.4gb]}{[survivor] [191.3mb]->[191.3mb]/[191.3mb]}{[old] [17.6gb]->[17.8gb]/[28.1gb]}
[2019-03-14T16:59:43,204][INFO ][o.e.m.j.JvmGcMonitorService] [esdata7] [gc][8680208] overhead, spent [749ms] collecting in the last [1.6s]
[2019-03-14T16:59:46,654][WARN ][o.e.m.j.JvmGcMonitorService] [esdata7] [gc][young][8680211][909686] duration [1.4s], collections [1]/[1.4s], total [1.4s]/[11.2h], m
emory [19.5gb]->[18.5gb]/[29.8gb], all_pools {[young] [1.4gb]->[5.7mb]/[1.4gb]}{[survivor] [191.3mb]->[191.3mb]/[191.3mb]}{[old] [17.8gb]->[18.3gb]/[28.1gb]}
[2019-03-14T16:59:46,654][WARN ][o.e.m.j.JvmGcMonitorService] [esdata7] [gc][8680211] overhead, spent [1.4s] collecting in the last [1.4s]
[2019-03-14T16:59:52,655][INFO ][o.e.m.j.JvmGcMonitorService] [esdata7] [gc][8680217] overhead, spent [280ms] collecting in the last [1s]    
emory [20.6gb]->[20gb]/[29.8gb], all_pools {[young] [1.3gb]->[3.6mb]/[1.4gb]}{[survivor] [191.3mb]->[191.3mb]/[191.3mb]}{[old] [19.1gb]->[19.8gb]/[28.1gb]}    
[2019-03-14T17:04:51,687][WARN ][o.e.m.j.JvmGcMonitorService] [esdata7] [gc][young][8680450][909737] duration [6.5s], collections [1]/[57.9s], total [6.5s]/[11.3h],     
[2019-03-14T17:06:09,016][WARN ][o.e.m.j.JvmGcMonitorService] [esdata7] [gc][8680459] overhead, spent [918ms] collecting in the last [1.3s]
[2019-03-14T17:07:17,808][WARN ][o.e.m.j.JvmGcMonitorService] [esdata7] [gc][old][8680460][965] duration [1.1m], collections [1]/[1.1m], total [1.1m]/[14.9m], memory [27.9gb]->[28.4gb]/[29.8gb], all_pools {[young] [3.5mb]->[285mb]/[1.4gb]}{[survivor] [191.3mb]->[0b]/[191.3mb]}{[old] [27.7gb]->[28.1gb]/[28.1gb]}
[2019-03-14T17:07:17,808][WARN ][o.e.m.j.JvmGcMonitorService] [esdata7] [gc][8680460] overhead, spent [1.1m] collecting in the last [1.1m]
[2019-03-14T17:08:30,897][WARN ][o.e.m.j.JvmGcMonitorService] [esdata7] [gc][old][8680463][966] duration [1.1m], collections [1]/[1.1m], total [1.1m]/[16m], memory [29.7gb]->[29.3gb]/[29.8gb], all_pools {[young] [1.4gb]->[1.2gb]/[1.4gb]}{[survivor] [156.1mb]->[0b]/[191.3mb]}{[old] [28.1gb]->[28.1gb]/[28.1gb]}
[2019-03-14T17:08:30,897][WARN ][o.e.m.j.JvmGcMonitorService] [esdata7] [gc][8680463] overhead, spent [1.1m] collecting in the last [1.1m]
[2019-03-14T17:08:30,903][WARN ][o.e.c.s.ClusterService   ] [esdata7] cluster state update task [zen-disco-receive(from master [master {esdata6}{JGboHUM9Sbaw7dZc8Bq56w}{WTfdxo5pTE-inTn0w541Nw}{10.24.16.99}{10.24.16.99:9300} committed version [752613]])] took [1.2m] above the warn threshold of 30s
[2019-03-14T17:09:27,122][WARN ][o.e.m.j.JvmGcMonitorService] [esdata7] [gc][old][8680464][967] duration [55.9s], collections [1]/[56.2s], total [55.9s]/[17m], memory [29.3gb]->[29.4gb]/[29.8gb], all_pools {[young] [1.2gb]->[1.3gb]/[1.4gb]}{[survivor] [0b]->[0b]/[191.3mb]}{[old] [28.1gb]->[28.1gb]/[28.1gb]}
[2019-03-14T17:09:27,122][WARN ][o.e.m.j.JvmGcMonitorService] [esdata7] [gc][8680464] overhead, spent [55.9s] collecting in the last [56.2s]
[2019-03-14T17:10:38,320][WARN ][o.e.m.j.JvmGcMonitorService] [esdata7] [gc][old][8680465][968] duration [1.1m], collections [1]/[1.1m], total [1.1m]/[18.2m], memory 

As you can see, the heap went up from 18G to 30G within minutes. This happened on all nodes. Is there any known bug in Elasticsearch that could lead to this situation?

See below for additional stack trace on data node

continued stack trace from data node 7

 [2019-03-14T17:14:37,908][WARN ][o.e.m.j.JvmGcMonitorService] [esdata7] [gc][old][8680466][972] duration [3.9m], collections [4]/[3.9m], total [3.9m]/[22.1m], memory [29.3gb]->[29.6gb]/[29.8gb], all_pools {[young] [1.2gb]->[1.4gb]/[1.4gb]}{[survivor] [0b]->[47.5mb]/[191.3mb]}{[old] [28.1gb]->[28.1gb]/[28.1gb]}
    [2019-03-14T17:14:37,909][WARN ][o.e.m.j.JvmGcMonitorService] [esdata7] [gc][8680466] overhead, spent [3.9m] collecting in the last [3.9m]
    [2019-03-14T17:14:37,909][WARN ][o.e.m.j.JvmGcMonitorService] [esdata7] [gc][8680466] overhead, spent [3.9m] collecting in the last [3.9m]
    [2019-03-14T17:14:37,944][ERROR][o.e.t.n.Netty4Utils      ] fatal error on the network layer
            at org.elasticsearch.transport.netty4.Netty4Utils.maybeDie(Netty4Utils.java:184)
            at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.exceptionCaught(Netty4MessageChannelHandler.java:83)
            at io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:285)
            at io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:264)
            at io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:256)
            at io.netty.channel.ChannelInboundHandlerAdapter.exceptionCaught(ChannelInboundHandlerAdapter.java:131)
            at io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:285)
            at io.netty.channel.AbstractChannelHandlerContext.notifyHandlerException(AbstractChannelHandlerContext.java:850)
            at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:364)
            at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
            at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
            at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:241)
            at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
            at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
            at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
            at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1334)
            at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
            at java.lang.Thread.run(Thread.java:748)
    [2019-03-14T17:14:37,963][WARN ][o.e.t.n.Netty4Transport  ] [esdata7] exception caught on transport layer [[id: 0xfe15ebf2, L:/10.24.16.90:50534 - R:10.24.16.76/10.24.16.76:9300]], closing connection
    io.netty.handler.codec.DecoderException: java.lang.OutOfMemoryError: Java heap space
            at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:269) ~[netty-codec-4.1.13.Final.jar:4.1.13.Final]
            at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) [netty-transport-4.1.13.Final.jar:4.1.13.Final]
            at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) [netty-transport-4.1.13.Final.jar:4.1.13.Final]
            at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) [netty-transport-4.1.13.Final.jar:4.1.13.Final]
            at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:241) [netty-handler-4.1.13.Final.jar:4.1.13.Final]
            at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) [netty-transport-4.1.13.Final.jar:4.1.13.Final]
            at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) [netty-transport-4.1.13.Final.jar:4.1.13.Final]
            at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) [netty-transport-4.1.13.Final.jar:4.1.13.Final]
            at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1334) [netty-transport-4.1.13.Final.jar:4.1.13.Final]
            at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) [netty-transport-4.1.13.Final.jar:4.1.13.Final]
            at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) [netty-transport-4.1.13.Final.jar:4.1.13.Final]
            at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:926) [netty-transport-4.1.13.Final.jar:4.1.13.Final]
            at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:134) [netty-transport-4.1.13.Final.jar:4.1.13.Final]
            at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:644) [netty-transport-4.1.13.Final.jar:4.1.13.Final]
            at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:544) [netty-transport-4.1.13.Final.jar:4.1.13.Final]
            at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:498) [netty-transport-4.1.13.Final.jar:4.1.13.Final]
            at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:458) [netty-transport-4.1.13.Final.jar:4.1.13.Final]
            at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) [netty-common-4.1.13.Final.jar:4.1.13.Final]
            at java.lang.Thread.run(Thread.java:748) [?:1.8.0_191]
    Caused by: java.lang.OutOfMemoryError: Java heap space
            at io.netty.util.internal.PlatformDependent.allocateUninitializedArray(PlatformDependent.java:189) ~[?:?]

More info:

  • Number of shards isn't an issue though we have a lot (this configuration has been working for us for years
  • We didn't see any crazy activity during this time in the cluster that could have caused OOM error on all nodes (we keep one replica)
  • The docker container is configured with 30G heap space. The host configuration is c32_m128.
  • Everything started working once we rebooted the containers, but the cluster was unavailable until then. It was scary.

Let me know if you need additional information.

It is very likely that you were experiencing out of memory errors before, but that they were hidden from you because of Netty swallowing these errors. Instead, since Unwrap causes when maybe dying we will now find these swallowed exceptions, and if they are fatal, kill Elasticsearch. We consider an out of memory error to be fatal and unrecoverable. When the JVM dies from an out of memory error, it should produce a heap dump which you will need to investigate to understand the cause of the out of memory error. Very often it is caused by something like a bad query, or indexing at a rate that is higher than the cluster is provisioned for.

Thanks for looking into it.

It is very likely that you were experiencing out of memory errors before, but that they were hidden from you because of Netty swallowing these errors

The logs seem to indicate that the memory increased in a period of 15m and the nodes were attempting to GC, but died after ~15m.

Where is the heap dump generated? What is the best way to determine the root cause? We have added safeguards to our API to prevent bad queries and I believe Elasticsearch has built in circuit breakers as well. What would cause all nodes in the server to die? One bad query would likely result in 2 dead nodes (we have 1 replica). It is possible that the cluster attempted to recover from dead nodes and the client retried the query multiple times leading to a cascading effect (multiple dead nodes), but I am skeptical if that actually happened.

By default it should land in the working directory for Elasticsearch. You can be explicit about it by configuring -XX:HeapDumpPath=<path> in jvm.options.

You will have to investigate the heap and see what was consuming the majority of the heap at the time that the heap dump was produced.

Circuit breakers help but they are not perfect, especially for the version that you are running (as we have made improvements in later versions). They are especially prone to issues for aggregations where it is difficult to estimate in advance the amount of memory that an aggregation will consume.

This could happen if a bad search request hits many shards which are distributed around the cluster.

Alright, I will keep an eye if it happens again (so far it happened twice). Thank you for the pointers.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.