Master and Coordinating Nodes crashing with "fatal error on the network layer" and Heap OOM

vijay.sangha · November 6, 2018, 8:14pm

All was well until the recent reboot of the VM. Nothing changes except the OS patching. Somwhow the network connection between the nodes is dropping but they are not stayinh up for more than 15 minutes.

Here is my setup

Elasticsearch Version :- 5.3.0
4 Data nodes - 5 Gig Heap ( No Issues there)
2 Master Nodes - 1 Gig heap ( Crashing with Fatal network error and Heap OOM)
2 Coordinating Nodes - 2 Gig Heap ( Crashing with Fatal network error and Heap OOM)

Here is the Error Stack

[2018-11-06T14:54:38,421][ERROR][o.e.t.n.Netty4Utils ] fatal error on the network layer
at org.elasticsearch.transport.netty4.Netty4Utils.maybeDie(Netty4Utils.java:140)
at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.exceptionCaught(Netty4MessageChannelHandler.java:83)
at io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:286)
at io.netty.channel.AbstractChannelHandlerContext.notifyHandlerException(AbstractChannelHandlerContext.java:851)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:349)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:341)
at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:293)
at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:280)
at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:396)
at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:248)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:363)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:349)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:341)
at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1334)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:363)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:349)
at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:926)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:129)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:642)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:527)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:481)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:441)
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
at java.lang.Thread.run(Thread.java:745)
[2018-11-06T14:54:38,409][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [NLOG3_Master] fatal error in thread [elasticsearch[NLOG3_Master][clusterService#updateTask][T#1]], exiting
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:3181) ~[?:1.8.0_73]
at java.util.ArrayList.grow(ArrayList.java:261) ~[?:1.8.0_73]
at java.util.ArrayList.ensureExplicitCapacity(ArrayList.java:235) ~[?:1.8.0_73]
at java.util.ArrayList.ensureCapacityInternal(ArrayList.java:227) ~[?:1.8.0_73]
at java.util.ArrayList.add(ArrayList.java:458) ~[?:1.8.0_73]
at org.elasticsearch.cluster.routing.IndexShardRoutingTable$Builder.addShard(IndexShardRoutingTable.java:585) ~[elasticsearch-5.3.0.jar:5.3.0]
at org.elasticsearch.cluster.routing.IndexRoutingTable$Builder.addShard(IndexRoutingTable.java:532) ~[elasticsearch-5.3.0.jar:5.3.0]
at org.elasticsearch.cluster.routing.RoutingTable$Builder.updateNodes(RoutingTable.java:449) ~[elasticsearch-5.3.0.jar:5.3.0]
at org.elasticsearch.cluster.routing.allocation.AllocationService.buildResultAndLogHealthChange(AllocationService.java:101) ~[elasticsearch-5.3.0.jar:5.3.0]
at org.elasticsearch.cluster.routing.allocation.AllocationService.applyStartedShards(AllocationService.java:95) ~[elasticsearch-5.3.0.jar:5.3.0]
at org.elasticsearch.cluster.action.shard.ShardStateAction$ShardStartedClusterStateTaskExecutor.execute(ShardStateAction.java:438) ~[elasticsearch-5.3.0.jar:5.3.0]
at org.elasticsearch.cluster.service.ClusterService.executeTasks(ClusterService.java:679) ~[elasticsearch-5.3.0.jar:5.3.0]
at org.elasticsearch.cluster.service.ClusterService.calculateTaskOutputs(ClusterService.java:658) ~[elasticsearch-5.3.0.jar:5.3.0]
at org.elasticsearch.cluster.service.ClusterService.runTasks(ClusterService.java:617) ~[elasticsearch-5.3.0.jar:5.3.0]
at org.elasticsearch.cluster.service.ClusterService$UpdateTask.run(ClusterService.java:1117) ~[elasticsearch-5.3.0.jar:5.3.0]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:544) ~[elasticsearch-5.3.0.jar:5.3.0]
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:238) ~[elasticsearch-5.3.0.jar:5.3.0]
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:201) ~[elasticsearch-5.3.0.jar:5.3.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) ~[?:1.8.0_73]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) ~[?:1.8.0_73]
at java.lang.Thread.run(Thread.java:745) [?:1.8.0_73]

warkolm · November 6, 2018, 8:18pm

What OS? Can you roll those changes back?

vijay.sangha · November 6, 2018, 8:20pm

Red Hat Enterprise Linux Server release 6.10 (Santiago)

No rolling them back is not an option as far as I think because that was a security patch.

But I do have exact setup in Production and that is also patched with this past weekend and I do not see any issues in Production.

warkolm · November 6, 2018, 8:21pm

What kernel version is it?

vijay.sangha · November 6, 2018, 8:22pm

Here's the Kernel version

2.6.32-754.3.5.el6.x86_64

vijay.sangha · November 6, 2018, 8:24pm

These were patched for Spectre/Meltdown vulnerability

vijay.sangha · November 6, 2018, 8:25pm

Strange thing is I see these netty disconnects only on Master and Coordinating nodes. I dont see any issues on Data nodes as such. They are not crashing and are able to join back the Cluster quickly. The Cluster was up and running and working well for almost a year now and didnt see any such issues as such.

vijay.sangha · November 6, 2018, 8:33pm

While reading through some of the discussion posts I have tried applying the below changes but none of them helped to workaround the issue :-

#--------------------------- Discovery Tuning ----------------------------------------------##

discovery.zen.fd.ping_timeout: 90s
discovery.zen.join_timeout: 90s
transport.tcp.connect_timeout: 60s
transport.tcp.compress: true

##-------------------NETTY tuning for OOM ------------------------------###

-XX:MaxDirectMemorySize=1G
-Dio.netty.allocator.pageSize=8192
-Dio.netty.allocator.maxOrder=10

vijay.sangha · November 9, 2018, 5:22pm

Is there anyone who ever got the same exception? Any workaround?

system · December 7, 2018, 5:22pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Node goes down showing fatal error in network layer , thread and java heapspace error Elasticsearch	19	1294	August 12, 2019
Fatal error in network and heap Elasticsearch	6	4259	March 4, 2018
ElasticSearch crashes with 5.3.1 Client Elasticsearch	8	2385	June 16, 2017
[ Frequent OOME on coordinator node ] Elasticsearch	8	855	May 18, 2018
Es node suddenly OOM with "fatal error on the network layer" Elasticsearch	3	769	May 19, 2019

Master and Coordinating Nodes crashing with "fatal error on the network layer" and Heap OOM

Related topics