ES 5.2.2 and now every so often a DN node goes offline and then recovers?

Hi.
I have an ES topology that consists the following:

2 x Master node
2 x Data node
2 x Ingest node

5.2.2 On Ubuntu 16.0.4 that was recently upgraded from 2.3.4 (over the weekend and had no problems before with the same amount of load).

Every few hours one of my data nodes goes offline and then recovers again. Everything looks fine. Server pings between each other are also ok.

What kind of logging/debugging do you recommend I enable on my nodes to see why it's happening?

Default viewing the /var/log/elasticsearch/elasticsearch.log shows the following:

Master node 1:
[2017-03-13T18:58:37,879][WARN ][o.e.d.z.PublishClusterStateAction] [els01] timed out waiting for all nodes to process published state [3643] (timeout [30s], pending nodes: [{els04}{qKAxnTFkQv-EMXujFgLEWg}{geExl20MRA6st2MUqNKJEA}{192.168.10.24}{192.168.10.24:9300}{rack=side-b}])
[2017-03-13T18:58:37,897][WARN ][o.e.c.s.ClusterService ] [els01] cluster state update task [shard-started[shard id [[heartbeat-2017.03.13][0]], allocation id [UHKyd2DtRNmdpq4n1Yrkfg], primary term [0], message [after peer recovery]]] took [30s] above the warn threshold of 30s
[2017-03-13T19:43:44,359][WARN ][o.e.m.j.JvmGcMonitorService] [els01] [gc][young][151784][4813] duration [1.4s], collections [1]/[2.1s], total [1.4s]/[43.8s], memory [1.3gb]->[1.1gb]/[1.9gb], all_pools {[young] [265.9mb]->[3.3mb]/[266.2mb]}{[survivor] [10.4mb]->[7.8mb]/[33.2mb]}{[old] [1.1gb]->[1.1gb]/[1.6gb]}
[2017-03-13T19:43:44,359][WARN ][o.e.m.j.JvmGcMonitorService] [els01] [gc][151784] overhead, spent [1.4s] collecting in the last [2.1s]
[2017-03-13T19:59:47,654][WARN ][o.e.d.z.PublishClusterStateAction] [els01] timed out waiting for all nodes to process published state [3724] (timeout [30s], pending nodes: [{els04}{qKAxnTFkQv-EMXujFgLEWg}{geExl20MRA6st2MUqNKJEA}{192.168.10.24}{192.168.10.24:9300}{rack=side-b}])
[2017-03-13T19:59:48,006][WARN ][o.e.c.s.ClusterService ] [els01] cluster state update task [shard-started[shard id [[.marvel-es-1-2017.03.05][0]], allocation id [nnPyZrTITy-mmN2UNmhd3w], primary term [0], message [after peer recovery]]] took [30.3s] above the warn threshold of 30s
[2017-03-13T20:29:23,783][DEBUG][o.e.a.a.c.n.s.TransportNodesStatsAction] [els01] failed to execute on node [HUAJtLt4Q2WRekWenXFrtQ]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [els03][192.168.10.23:9300][cluster:monitor/nodes/stats[n]] request_id [1034488] timed out after [15001ms]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:908) [elasticsearch-5.2.2.jar:5.2.2]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:527) [elasticsearch-5.2.2.jar:5.2.2]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_121]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_121]
at java.lang.Thread.run(Thread.java:745) [?:1.8.0_121]
[2017-03-13T20:29:23,783][WARN ][o.e.a.a.c.n.s.TransportNodesStatsAction] [els01] not accumulating exceptions, excluding exception from response
org.elasticsearch.action.FailedNodeException: Failed node [HUAJtLt4Q2WRekWenXFrtQ]
at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction.onFailure(TransportNodesAction.java:247) [elasticsearch-5.2.2.jar:5.2.2]
at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction.access$300(TransportNodesAction.java:160) [elasticsearch-5.2.2.jar:5.2.2]
at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction$1.handleException(TransportNodesAction.java:219) [elasticsearch-5.2.2.jar:5.2.2]
at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1024) [elasticsearch-5.2.2.jar:5.2.2]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:907) [elasticsearch-5.2.2.jar:5.2.2]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:527) [elasticsearch-5.2.2.jar:5.2.2]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_121]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_121]
at java.lang.Thread.run(Thread.java:745) [?:1.8.0_121]
Caused by: org.elasticsearch.transport.ReceiveTimeoutTransportException: [els03][192.168.10.23:9300][cluster:monitor/nodes/stats[n]] request_id [1034488] timed out after [15001ms]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:908) ~[elasticsearch-5.2.2.jar:5.2.2]
... 4 more
[2017-03-13T20:29:31,510][WARN ][o.e.t.TransportService ] [els01] Received response for a request that has timed out, sent [43887ms] ago, timed out [13887ms] ago, action [internal:discovery/zen/fd/ping], node [{els03}{HUAJtLt4Q2WRekWenXFrtQ}{oJyZQ0VWTXy_VahANT1QHw}{192.168.10.23}{192.168.10.23:9300}{rack=side-a}], id [1034403]
[2017-03-13T20:29:31,534][WARN ][o.e.t.TransportService ] [els01] Received response for a request that has timed out, sent [22750ms] ago, timed out [7749ms] ago, action [cluster:monitor/nodes/stats[n]], node [{els03}{HUAJtLt4Q2WRekWenXFrtQ}{oJyZQ0VWTXy_VahANT1QHw}{192.168.10.23}{192.168.10.23:9300}{rack=side-a}], id [1034488]

Check the logs for the master and els04, there may be something that will help.

That's bad, see https://www.elastic.co/guide/en/elasticsearch/guide/2.x/important-configuration-changes.html#_minimum_master_nodes

Thanks for that.
I found out that with the upgrade it removed ulimit and mlock and all those settings so possibly when GC would run it would crash nodes?

I have yet to install X-Pack which I will do next so I can see whats going on!

Anyway, went through this page and now for 12hrs the cluster has been stable.

https://www.elastic.co/guide/en/elasticsearch/reference/current/setting-system-settings.html#setting-system-settings

For a small cluster should the setup like this?

3 x MN
2 x DN
2 x IN ?

Or should I go for 3 nodes of each type?

Do you have a public accessible 'best practices/design guide that I could follow?

Thanks.

1 Like

The definitive guide has some good stuff, my comment was 100% directed at the masters :slight_smile:

All good now.

Fixed up my Java settings :slight_smile:

What settings?

/etc/elasticsearch/jvm.options

I have 16GB RAM on my data nodes so set it to use 1/2.

-Xms8g
-Xmx8g

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.