Hi.
I have an ES topology that consists the following:
2 x Master node
2 x Data node
2 x Ingest node
5.2.2 On Ubuntu 16.0.4 that was recently upgraded from 2.3.4 (over the weekend and had no problems before with the same amount of load).
Every few hours one of my data nodes goes offline and then recovers again. Everything looks fine. Server pings between each other are also ok.
What kind of logging/debugging do you recommend I enable on my nodes to see why it's happening?
Default viewing the /var/log/elasticsearch/elasticsearch.log shows the following:
Master node 1:
[2017-03-13T18:58:37,879][WARN ][o.e.d.z.PublishClusterStateAction] [els01] timed out waiting for all nodes to process published state [3643] (timeout [30s], pending nodes: [{els04}{qKAxnTFkQv-EMXujFgLEWg}{geExl20MRA6st2MUqNKJEA}{192.168.10.24}{192.168.10.24:9300}{rack=side-b}])
[2017-03-13T18:58:37,897][WARN ][o.e.c.s.ClusterService ] [els01] cluster state update task [shard-started[shard id [[heartbeat-2017.03.13][0]], allocation id [UHKyd2DtRNmdpq4n1Yrkfg], primary term [0], message [after peer recovery]]] took [30s] above the warn threshold of 30s
[2017-03-13T19:43:44,359][WARN ][o.e.m.j.JvmGcMonitorService] [els01] [gc][young][151784][4813] duration [1.4s], collections [1]/[2.1s], total [1.4s]/[43.8s], memory [1.3gb]->[1.1gb]/[1.9gb], all_pools {[young] [265.9mb]->[3.3mb]/[266.2mb]}{[survivor] [10.4mb]->[7.8mb]/[33.2mb]}{[old] [1.1gb]->[1.1gb]/[1.6gb]}
[2017-03-13T19:43:44,359][WARN ][o.e.m.j.JvmGcMonitorService] [els01] [gc][151784] overhead, spent [1.4s] collecting in the last [2.1s]
[2017-03-13T19:59:47,654][WARN ][o.e.d.z.PublishClusterStateAction] [els01] timed out waiting for all nodes to process published state [3724] (timeout [30s], pending nodes: [{els04}{qKAxnTFkQv-EMXujFgLEWg}{geExl20MRA6st2MUqNKJEA}{192.168.10.24}{192.168.10.24:9300}{rack=side-b}])
[2017-03-13T19:59:48,006][WARN ][o.e.c.s.ClusterService ] [els01] cluster state update task [shard-started[shard id [[.marvel-es-1-2017.03.05][0]], allocation id [nnPyZrTITy-mmN2UNmhd3w], primary term [0], message [after peer recovery]]] took [30.3s] above the warn threshold of 30s
[2017-03-13T20:29:23,783][DEBUG][o.e.a.a.c.n.s.TransportNodesStatsAction] [els01] failed to execute on node [HUAJtLt4Q2WRekWenXFrtQ]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [els03][192.168.10.23:9300][cluster:monitor/nodes/stats[n]] request_id [1034488] timed out after [15001ms]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:908) [elasticsearch-5.2.2.jar:5.2.2]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:527) [elasticsearch-5.2.2.jar:5.2.2]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_121]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_121]
at java.lang.Thread.run(Thread.java:745) [?:1.8.0_121]
[2017-03-13T20:29:23,783][WARN ][o.e.a.a.c.n.s.TransportNodesStatsAction] [els01] not accumulating exceptions, excluding exception from response
org.elasticsearch.action.FailedNodeException: Failed node [HUAJtLt4Q2WRekWenXFrtQ]
at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction.onFailure(TransportNodesAction.java:247) [elasticsearch-5.2.2.jar:5.2.2]
at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction.access$300(TransportNodesAction.java:160) [elasticsearch-5.2.2.jar:5.2.2]
at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction$1.handleException(TransportNodesAction.java:219) [elasticsearch-5.2.2.jar:5.2.2]
at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1024) [elasticsearch-5.2.2.jar:5.2.2]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:907) [elasticsearch-5.2.2.jar:5.2.2]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:527) [elasticsearch-5.2.2.jar:5.2.2]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_121]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_121]
at java.lang.Thread.run(Thread.java:745) [?:1.8.0_121]
Caused by: org.elasticsearch.transport.ReceiveTimeoutTransportException: [els03][192.168.10.23:9300][cluster:monitor/nodes/stats[n]] request_id [1034488] timed out after [15001ms]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:908) ~[elasticsearch-5.2.2.jar:5.2.2]
... 4 more
[2017-03-13T20:29:31,510][WARN ][o.e.t.TransportService ] [els01] Received response for a request that has timed out, sent [43887ms] ago, timed out [13887ms] ago, action [internal:discovery/zen/fd/ping], node [{els03}{HUAJtLt4Q2WRekWenXFrtQ}{oJyZQ0VWTXy_VahANT1QHw}{192.168.10.23}{192.168.10.23:9300}{rack=side-a}], id [1034403]
[2017-03-13T20:29:31,534][WARN ][o.e.t.TransportService ] [els01] Received response for a request that has timed out, sent [22750ms] ago, timed out [7749ms] ago, action [cluster:monitor/nodes/stats[n]], node [{els03}{HUAJtLt4Q2WRekWenXFrtQ}{oJyZQ0VWTXy_VahANT1QHw}{192.168.10.23}{192.168.10.23:9300}{rack=side-a}], id [1034488]