Fatal error on the network layer


(Niraj Kumar) #1

Hi Everyone,

I am almost stumped now. My elasticsearch went dead and all shards were in unassigned state. We started the services back and when the shards started getting relocated things were fine upto certain point but it crashes after an hour or two with the following error.

[2017-08-01T15:28:43,733][WARN ][o.e.m.j.JvmGcMonitorService] [xx.xx.xx.xx] [gc][old][17873][562] duration [24.7s], collections [1]/[24.8s], total [24.7s]/[7.7m], memory [9.9gb]->[9.7gb]/[9.9gb], all_pools {[young] [532.5mb]->[379.2mb]/[532.5mb]}{[survivor] [59.2mb]->[0b]/[66.5mb]}{[old] [9.3gb]->[9.3gb]/[9.3gb]}
[2017-08-01T15:28:43,734][WARN ][o.e.m.j.JvmGcMonitorService] [xx.xx.xx.xx] [gc][17873] overhead, spent [24.7s] collecting in the last [24.8s]
[2017-08-01T15:28:44,734][INFO ][o.e.m.j.JvmGcMonitorService] [xx.xx.xx.xx] [gc][17874] overhead, spent [435ms] collecting in the last [1s]
    [2017-08-01T15:29:42,373][WARN ][o.e.m.j.JvmGcMonitorService] [xx.xx.xx.xx] [gc][old][17876][564] duration [25s], collections [1]/[25.8s], total [25s]/[8.6m], memory [9.7gb]->[9.8gb]/[9.9gb], all_pools {[young] [425.9mb]->[473.7mb]/[532.5mb]}{[survivor] [0b]->[0b]/[66.5mb]}{[old] [9.3gb]->[9.3gb]/[9.3gb]}
[2017-08-01T15:29:42,373][WARN ][o.e.m.j.JvmGcMonitorService] [xx.xx.xx.xx] [gc][17876] overhead, spent [25s] collecting in the last [25.8s]
[2017-08-01T15:30:14,533][WARN ][o.e.m.j.JvmGcMonitorService] [xx.xx.xx.xx] [gc][old][17877][565] duration [31.4s], collections [1]/[32.1s], total [31.4s]/[9.1m], memory [9.8gb]->[9.8gb]/[9.9gb], all_pools {[young] [473.7mb]->[508.9mb]/[532.5mb]}{[survivor] [0b]->[0b]/[66.5mb]}{[old] [9.3gb]->[9.3gb]/[9.3gb]}
[2017-08-01T15:30:14,534][WARN ][o.e.m.j.JvmGcMonitorService] [xx.xx.xx.xx] [gc][17877] overhead, spent [31.4s] collecting in the last [32.1s]
[2017-08-01T15:30:38,120][WARN ][o.e.m.j.JvmGcMonitorService] [xx.xx.xx.xx] [gc][old][17878][566] duration [22.9s], collections [1]/[23.1s], total [22.9s]/[9.5m], memory [9.8gb]->[9.8gb]/[9.9gb], all_pools {[young] [508.9mb]->[532.5mb]/[532.5mb]}{[survivor] [0b]->[622.3kb]/[66.5mb]}{[old] [9.3gb]->[9.3gb]/[9.3gb]}
[2017-08-01T15:30:38,120][WARN ][o.e.m.j.JvmGcMonitorService] [xx.xx.xx.xx] [gc][17878] overhead, spent [22.9s] collecting in the last [23.1s]
[2017-08-01T15:31:10,660][WARN ][o.e.m.j.JvmGcMonitorService] [xx.xx.xx.xx] [gc][old][17879][567] duration [31.8s], collections [1]/[32.9s], total [31.8s]/[10m], memory [9.8gb]->[9.8gb]/[9.9gb], all_pools {[young] [532.5mb]->[532.5mb]/[532.5mb]}{[survivor] [622.3kb]->[30.3mb]/[66.5mb]}{[old] [9.3gb]->[9.3gb]/[9.3gb]}
[2017-08-01T15:31:10,660][WARN ][o.e.m.j.JvmGcMonitorService] [xx.xx.xx.xx] [gc][17879] overhead, spent [31.8s] collecting in the last [32.9s]
[2017-08-01T15:31:36,182][WARN ][o.e.m.j.JvmGcMonitorService] [xx.xx.xx.xx] [gc][old][17880][568] duration [25.1s], collections [1]/[25.5s], total [25.1s]/[10.5m], memory [9.8gb]->[9.9gb]/[9.9gb], all_pools {[young] [532.5mb]->[532.5mb]/[532.5mb]}{[survivor] [30.3mb]->[53.2mb]/[66.5mb]}{[old] [9.3gb]->[9.3gb]/[9.3gb]}
[2017-08-01T15:31:36,182][WARN ][o.e.m.j.JvmGcMonitorService] [xx.xx.xx.xx] [gc][17880] overhead, spent [25.1s] collecting in the last [25.5s]
[2017-08-01T15:37:27,404][ERROR][o.e.t.n.Netty4Utils      ] fatal error on the network layer
        at org.elasticsearch.transport.netty4.Netty4Utils.maybeDie(Netty4Utils.java:140)
        at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.exceptionCaught(Netty4MessageChannelHandler.java:83)
        at io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:286)
        at io.netty.channel.AbstractChannelHandlerContext.notifyHandlerException(AbstractChannelHandlerContext.java:851)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:349)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:341)
        at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:293)
        at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:280)
        at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:396)
        at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:248)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:363)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:349)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:341)
        at io.netty.channel.ChannelInboundHandlerAdapter.channelRead(ChannelInboundHandlerAdapter.java:86)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:363)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:349)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:341)
        at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1334)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:363)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:349)
        at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:926)
        at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:129)
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:642)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:527)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:481)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:441)
        at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
        at java.lang.Thread.run(Thread.java:745)

OS:- Ubuntu 14.04
Java: 1.8
Heap Space:- 10G
Node Type:- Data

I have no clue what's going wrong.


(David Pilato) #2

It sounds like you have some memory pressure.

Can you tell:

  • es version
  • number of shards / index

(Niraj Kumar) #3

Hi @dadoonet,

ES Version: 5.2.1
Number of shards :-

{
    "cluster_name": "elk-prod",
    "status": "yellow",
    "timed_out": false,
    "number_of_nodes": 9,
    "number_of_data_nodes": 3,
    "active_primary_shards": 13561,
    "active_shards": 22002,
    "relocating_shards": 0,
    "initializing_shards": 6,
    "unassigned_shards": 5114,
    "delayed_unassigned_shards": 0,
    "number_of_pending_tasks": 7,
    "number_of_in_flight_fetch": 0,
    "task_max_waiting_in_queue_millis": 3356858,
    "active_shards_percent_as_number": 81.12233611090628
}

This is the time when it is gonna fail. It is till relocating shards now.


(Niraj Kumar) #4

Adding snapshot from terminal.

The screenshot is in order

Layer 1:- Data
Layer 2:- Master
Layer 3:- Ingest

--
Niraj


(David Pilato) #5

22002 shards on 3 data nodes?

You did not mention the size (memory) of the data nodes but FWIW it's 7334 shards per node!

Which is a way too much.

Reduce that number or increase dramatically the number of nodes IMO.


(Niraj Kumar) #6

But how do i reduce it? The point is that to have the shrink API running the cluster should be in healthy state and the cluster is not coming back online.

Is there any other way, like just a workaround as of now.?

--
Niraj


(David Pilato) #7

Decrease the number of shards per index if you are using the default value (5) and/or remove old indices you are not using anymore if you are playing with time series data.


(Niraj Kumar) #8

I have already decreased the number of shards to 3 from default (5). Well i am trying to delete data but even if i leave behind 3 months of cloudtrail data i would end up having a large amount of shard. I know this is bad but i am really thinking of a way to retain some data, bring the cluster back online and shrink these.

Also will a large time-series based data need to work efficiently need to be on a hot-warm architecture?

What do you suggest for ingesting 200 AWS Accounts cloudtrail data?

--
Niraj


(David Pilato) #9

I have already decreased the number of shards to 3 from default

Why 3? How much is a shard size in gb?

Also will a large time-series based data need to work efficiently need to be on a hot-warm architecture?

It's not absolutely needed but it's a good practice, money wise, because it helps to have "lighter" nodes instead of having only big nodes.

What do you suggest for ingesting 200 AWS Accounts cloudtrail data?

I have no experience on cloudtrail.


(Niraj Kumar) #10

Why 3? How much is a shard size in gb?

I used three shards because we were using 3 data node. Not sure if this was a wise decision.
Also how can i find the size of each shard?


(David Pilato) #11

The cat shards API will tell you


(system) #12

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.