Fatal error on the network layer

Hi Everyone,

I am almost stumped now. My elasticsearch went dead and all shards were in unassigned state. We started the services back and when the shards started getting relocated things were fine upto certain point but it crashes after an hour or two with the following error.

[2017-08-01T15:28:43,733][WARN ][o.e.m.j.JvmGcMonitorService] [xx.xx.xx.xx] [gc][old][17873][562] duration [24.7s], collections [1]/[24.8s], total [24.7s]/[7.7m], memory [9.9gb]->[9.7gb]/[9.9gb], all_pools {[young] [532.5mb]->[379.2mb]/[532.5mb]}{[survivor] [59.2mb]->[0b]/[66.5mb]}{[old] [9.3gb]->[9.3gb]/[9.3gb]}
[2017-08-01T15:28:43,734][WARN ][o.e.m.j.JvmGcMonitorService] [xx.xx.xx.xx] [gc][17873] overhead, spent [24.7s] collecting in the last [24.8s]
[2017-08-01T15:28:44,734][INFO ][o.e.m.j.JvmGcMonitorService] [xx.xx.xx.xx] [gc][17874] overhead, spent [435ms] collecting in the last [1s]
    [2017-08-01T15:29:42,373][WARN ][o.e.m.j.JvmGcMonitorService] [xx.xx.xx.xx] [gc][old][17876][564] duration [25s], collections [1]/[25.8s], total [25s]/[8.6m], memory [9.7gb]->[9.8gb]/[9.9gb], all_pools {[young] [425.9mb]->[473.7mb]/[532.5mb]}{[survivor] [0b]->[0b]/[66.5mb]}{[old] [9.3gb]->[9.3gb]/[9.3gb]}
[2017-08-01T15:29:42,373][WARN ][o.e.m.j.JvmGcMonitorService] [xx.xx.xx.xx] [gc][17876] overhead, spent [25s] collecting in the last [25.8s]
[2017-08-01T15:30:14,533][WARN ][o.e.m.j.JvmGcMonitorService] [xx.xx.xx.xx] [gc][old][17877][565] duration [31.4s], collections [1]/[32.1s], total [31.4s]/[9.1m], memory [9.8gb]->[9.8gb]/[9.9gb], all_pools {[young] [473.7mb]->[508.9mb]/[532.5mb]}{[survivor] [0b]->[0b]/[66.5mb]}{[old] [9.3gb]->[9.3gb]/[9.3gb]}
[2017-08-01T15:30:14,534][WARN ][o.e.m.j.JvmGcMonitorService] [xx.xx.xx.xx] [gc][17877] overhead, spent [31.4s] collecting in the last [32.1s]
[2017-08-01T15:30:38,120][WARN ][o.e.m.j.JvmGcMonitorService] [xx.xx.xx.xx] [gc][old][17878][566] duration [22.9s], collections [1]/[23.1s], total [22.9s]/[9.5m], memory [9.8gb]->[9.8gb]/[9.9gb], all_pools {[young] [508.9mb]->[532.5mb]/[532.5mb]}{[survivor] [0b]->[622.3kb]/[66.5mb]}{[old] [9.3gb]->[9.3gb]/[9.3gb]}
[2017-08-01T15:30:38,120][WARN ][o.e.m.j.JvmGcMonitorService] [xx.xx.xx.xx] [gc][17878] overhead, spent [22.9s] collecting in the last [23.1s]
[2017-08-01T15:31:10,660][WARN ][o.e.m.j.JvmGcMonitorService] [xx.xx.xx.xx] [gc][old][17879][567] duration [31.8s], collections [1]/[32.9s], total [31.8s]/[10m], memory [9.8gb]->[9.8gb]/[9.9gb], all_pools {[young] [532.5mb]->[532.5mb]/[532.5mb]}{[survivor] [622.3kb]->[30.3mb]/[66.5mb]}{[old] [9.3gb]->[9.3gb]/[9.3gb]}
[2017-08-01T15:31:10,660][WARN ][o.e.m.j.JvmGcMonitorService] [xx.xx.xx.xx] [gc][17879] overhead, spent [31.8s] collecting in the last [32.9s]
[2017-08-01T15:31:36,182][WARN ][o.e.m.j.JvmGcMonitorService] [xx.xx.xx.xx] [gc][old][17880][568] duration [25.1s], collections [1]/[25.5s], total [25.1s]/[10.5m], memory [9.8gb]->[9.9gb]/[9.9gb], all_pools {[young] [532.5mb]->[532.5mb]/[532.5mb]}{[survivor] [30.3mb]->[53.2mb]/[66.5mb]}{[old] [9.3gb]->[9.3gb]/[9.3gb]}
[2017-08-01T15:31:36,182][WARN ][o.e.m.j.JvmGcMonitorService] [xx.xx.xx.xx] [gc][17880] overhead, spent [25.1s] collecting in the last [25.5s]
[2017-08-01T15:37:27,404][ERROR][o.e.t.n.Netty4Utils      ] fatal error on the network layer
        at org.elasticsearch.transport.netty4.Netty4Utils.maybeDie(Netty4Utils.java:140)
        at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.exceptionCaught(Netty4MessageChannelHandler.java:83)
        at io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:286)
        at io.netty.channel.AbstractChannelHandlerContext.notifyHandlerException(AbstractChannelHandlerContext.java:851)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:349)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:341)
        at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:293)
        at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:280)
        at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:396)
        at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:248)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:363)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:349)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:341)
        at io.netty.channel.ChannelInboundHandlerAdapter.channelRead(ChannelInboundHandlerAdapter.java:86)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:363)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:349)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:341)
        at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1334)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:363)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:349)
        at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:926)
        at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:129)
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:642)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:527)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:481)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:441)
        at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
        at java.lang.Thread.run(Thread.java:745)

OS:- Ubuntu 14.04
Java: 1.8
Heap Space:- 10G
Node Type:- Data

I have no clue what's going wrong.

It sounds like you have some memory pressure.

Can you tell:

  • es version
  • number of shards / index

Hi @dadoonet,

ES Version: 5.2.1
Number of shards :-

{
    "cluster_name": "elk-prod",
    "status": "yellow",
    "timed_out": false,
    "number_of_nodes": 9,
    "number_of_data_nodes": 3,
    "active_primary_shards": 13561,
    "active_shards": 22002,
    "relocating_shards": 0,
    "initializing_shards": 6,
    "unassigned_shards": 5114,
    "delayed_unassigned_shards": 0,
    "number_of_pending_tasks": 7,
    "number_of_in_flight_fetch": 0,
    "task_max_waiting_in_queue_millis": 3356858,
    "active_shards_percent_as_number": 81.12233611090628
}

This is the time when it is gonna fail. It is till relocating shards now.

Adding snapshot from terminal.

The screenshot is in order

Layer 1:- Data
Layer 2:- Master
Layer 3:- Ingest

--
Niraj

22002 shards on 3 data nodes?

You did not mention the size (memory) of the data nodes but FWIW it's 7334 shards per node!

Which is a way too much.

Reduce that number or increase dramatically the number of nodes IMO.

But how do i reduce it? The point is that to have the shrink API running the cluster should be in healthy state and the cluster is not coming back online.

Is there any other way, like just a workaround as of now.?

--
Niraj

Decrease the number of shards per index if you are using the default value (5) and/or remove old indices you are not using anymore if you are playing with time series data.

I have already decreased the number of shards to 3 from default (5). Well i am trying to delete data but even if i leave behind 3 months of cloudtrail data i would end up having a large amount of shard. I know this is bad but i am really thinking of a way to retain some data, bring the cluster back online and shrink these.

Also will a large time-series based data need to work efficiently need to be on a hot-warm architecture?

What do you suggest for ingesting 200 AWS Accounts cloudtrail data?

--
Niraj

I have already decreased the number of shards to 3 from default

Why 3? How much is a shard size in gb?

Also will a large time-series based data need to work efficiently need to be on a hot-warm architecture?

It's not absolutely needed but it's a good practice, money wise, because it helps to have "lighter" nodes instead of having only big nodes.

What do you suggest for ingesting 200 AWS Accounts cloudtrail data?

I have no experience on cloudtrail.

Why 3? How much is a shard size in gb?

I used three shards because we were using 3 data node. Not sure if this was a wise decision.
Also how can i find the size of each shard?

The cat shards API will tell you

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.