Fatal error on the network layer

niraj_kumar · August 1, 2017, 10:53pm

Hi Everyone,

I am almost stumped now. My elasticsearch went dead and all shards were in unassigned state. We started the services back and when the shards started getting relocated things were fine upto certain point but it crashes after an hour or two with the following error.

[2017-08-01T15:28:43,733][WARN ][o.e.m.j.JvmGcMonitorService] [xx.xx.xx.xx] [gc][old][17873][562] duration [24.7s], collections [1]/[24.8s], total [24.7s]/[7.7m], memory [9.9gb]->[9.7gb]/[9.9gb], all_pools {[young] [532.5mb]->[379.2mb]/[532.5mb]}{[survivor] [59.2mb]->[0b]/[66.5mb]}{[old] [9.3gb]->[9.3gb]/[9.3gb]}
[2017-08-01T15:28:43,734][WARN ][o.e.m.j.JvmGcMonitorService] [xx.xx.xx.xx] [gc][17873] overhead, spent [24.7s] collecting in the last [24.8s]
[2017-08-01T15:28:44,734][INFO ][o.e.m.j.JvmGcMonitorService] [xx.xx.xx.xx] [gc][17874] overhead, spent [435ms] collecting in the last [1s]
    [2017-08-01T15:29:42,373][WARN ][o.e.m.j.JvmGcMonitorService] [xx.xx.xx.xx] [gc][old][17876][564] duration [25s], collections [1]/[25.8s], total [25s]/[8.6m], memory [9.7gb]->[9.8gb]/[9.9gb], all_pools {[young] [425.9mb]->[473.7mb]/[532.5mb]}{[survivor] [0b]->[0b]/[66.5mb]}{[old] [9.3gb]->[9.3gb]/[9.3gb]}
[2017-08-01T15:29:42,373][WARN ][o.e.m.j.JvmGcMonitorService] [xx.xx.xx.xx] [gc][17876] overhead, spent [25s] collecting in the last [25.8s]
[2017-08-01T15:30:14,533][WARN ][o.e.m.j.JvmGcMonitorService] [xx.xx.xx.xx] [gc][old][17877][565] duration [31.4s], collections [1]/[32.1s], total [31.4s]/[9.1m], memory [9.8gb]->[9.8gb]/[9.9gb], all_pools {[young] [473.7mb]->[508.9mb]/[532.5mb]}{[survivor] [0b]->[0b]/[66.5mb]}{[old] [9.3gb]->[9.3gb]/[9.3gb]}
[2017-08-01T15:30:14,534][WARN ][o.e.m.j.JvmGcMonitorService] [xx.xx.xx.xx] [gc][17877] overhead, spent [31.4s] collecting in the last [32.1s]
[2017-08-01T15:30:38,120][WARN ][o.e.m.j.JvmGcMonitorService] [xx.xx.xx.xx] [gc][old][17878][566] duration [22.9s], collections [1]/[23.1s], total [22.9s]/[9.5m], memory [9.8gb]->[9.8gb]/[9.9gb], all_pools {[young] [508.9mb]->[532.5mb]/[532.5mb]}{[survivor] [0b]->[622.3kb]/[66.5mb]}{[old] [9.3gb]->[9.3gb]/[9.3gb]}
[2017-08-01T15:30:38,120][WARN ][o.e.m.j.JvmGcMonitorService] [xx.xx.xx.xx] [gc][17878] overhead, spent [22.9s] collecting in the last [23.1s]
[2017-08-01T15:31:10,660][WARN ][o.e.m.j.JvmGcMonitorService] [xx.xx.xx.xx] [gc][old][17879][567] duration [31.8s], collections [1]/[32.9s], total [31.8s]/[10m], memory [9.8gb]->[9.8gb]/[9.9gb], all_pools {[young] [532.5mb]->[532.5mb]/[532.5mb]}{[survivor] [622.3kb]->[30.3mb]/[66.5mb]}{[old] [9.3gb]->[9.3gb]/[9.3gb]}
[2017-08-01T15:31:10,660][WARN ][o.e.m.j.JvmGcMonitorService] [xx.xx.xx.xx] [gc][17879] overhead, spent [31.8s] collecting in the last [32.9s]
[2017-08-01T15:31:36,182][WARN ][o.e.m.j.JvmGcMonitorService] [xx.xx.xx.xx] [gc][old][17880][568] duration [25.1s], collections [1]/[25.5s], total [25.1s]/[10.5m], memory [9.8gb]->[9.9gb]/[9.9gb], all_pools {[young] [532.5mb]->[532.5mb]/[532.5mb]}{[survivor] [30.3mb]->[53.2mb]/[66.5mb]}{[old] [9.3gb]->[9.3gb]/[9.3gb]}
[2017-08-01T15:31:36,182][WARN ][o.e.m.j.JvmGcMonitorService] [xx.xx.xx.xx] [gc][17880] overhead, spent [25.1s] collecting in the last [25.5s]
[2017-08-01T15:37:27,404][ERROR][o.e.t.n.Netty4Utils      ] fatal error on the network layer
        at org.elasticsearch.transport.netty4.Netty4Utils.maybeDie(Netty4Utils.java:140)
        at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.exceptionCaught(Netty4MessageChannelHandler.java:83)
        at io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:286)
        at io.netty.channel.AbstractChannelHandlerContext.notifyHandlerException(AbstractChannelHandlerContext.java:851)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:349)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:341)
        at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:293)
        at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:280)
        at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:396)
        at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:248)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:363)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:349)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:341)
        at io.netty.channel.ChannelInboundHandlerAdapter.channelRead(ChannelInboundHandlerAdapter.java:86)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:363)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:349)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:341)
        at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1334)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:363)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:349)
        at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:926)
        at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:129)
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:642)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:527)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:481)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:441)
        at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
        at java.lang.Thread.run(Thread.java:745)

OS:- Ubuntu 14.04
Java: 1.8
Heap Space:- 10G
Node Type:- Data

I have no clue what's going wrong.

dadoonet · August 2, 2017, 2:47am

It sounds like you have some memory pressure.

Can you tell:

es version
number of shards / index

niraj_kumar · August 2, 2017, 3:17am

Hi @dadoonet,

ES Version: 5.2.1
Number of shards :-

{
    "cluster_name": "elk-prod",
    "status": "yellow",
    "timed_out": false,
    "number_of_nodes": 9,
    "number_of_data_nodes": 3,
    "active_primary_shards": 13561,
    "active_shards": 22002,
    "relocating_shards": 0,
    "initializing_shards": 6,
    "unassigned_shards": 5114,
    "delayed_unassigned_shards": 0,
    "number_of_pending_tasks": 7,
    "number_of_in_flight_fetch": 0,
    "task_max_waiting_in_queue_millis": 3356858,
    "active_shards_percent_as_number": 81.12233611090628
}

This is the time when it is gonna fail. It is till relocating shards now.

niraj_kumar · August 2, 2017, 3:34am

Adding snapshot from terminal.

The screenshot is in order

Layer 1:- Data
Layer 2:- Master
Layer 3:- Ingest

--
Niraj

dadoonet · August 2, 2017, 7:02am

22002 shards on 3 data nodes?

You did not mention the size (memory) of the data nodes but FWIW it's 7334 shards per node!

Which is a way too much.

Reduce that number or increase dramatically the number of nodes IMO.

niraj_kumar · August 2, 2017, 4:41pm

But how do i reduce it? The point is that to have the shrink API running the cluster should be in healthy state and the cluster is not coming back online.

Is there any other way, like just a workaround as of now.?

--
Niraj

dadoonet · August 2, 2017, 5:26pm

Decrease the number of shards per index if you are using the default value (5) and/or remove old indices you are not using anymore if you are playing with time series data.

niraj_kumar · August 2, 2017, 5:30pm

I have already decreased the number of shards to 3 from default (5). Well i am trying to delete data but even if i leave behind 3 months of cloudtrail data i would end up having a large amount of shard. I know this is bad but i am really thinking of a way to retain some data, bring the cluster back online and shrink these.

Also will a large time-series based data need to work efficiently need to be on a hot-warm architecture?

What do you suggest for ingesting 200 AWS Accounts cloudtrail data?

--
Niraj

dadoonet · August 2, 2017, 5:46pm

I have already decreased the number of shards to 3 from default

Why 3? How much is a shard size in gb?

Also will a large time-series based data need to work efficiently need to be on a hot-warm architecture?

It's not absolutely needed but it's a good practice, money wise, because it helps to have "lighter" nodes instead of having only big nodes.

What do you suggest for ingesting 200 AWS Accounts cloudtrail data?

I have no experience on cloudtrail.

niraj_kumar · August 2, 2017, 6:25pm

Why 3? How much is a shard size in gb?

I used three shards because we were using 3 data node. Not sure if this was a wise decision.
Also how can i find the size of each shard?

dadoonet · August 2, 2017, 6:38pm

The cat shards API will tell you

system · August 30, 2017, 6:39pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Fatal error in network and heap Elasticsearch	6	4233	March 4, 2018
Master and Coordinating Nodes crashing with "fatal error on the network layer" and Heap OOM Elasticsearch	9	905	December 7, 2018
Elasticsearch crash unassigned all shards Elasticsearch	4	480	July 21, 2019
Crash after a few days Elasticsearch	5	1459	October 17, 2017
Fatal Error in thread Elasticsearch Elasticsearch	8	2120	February 7, 2020

Fatal error on the network layer

Related topics