ES 5.4.0 - Collector Timed Out and Nodes Disconnected

We started to see this issue with our Elasticsearch 5.4.0 Cluster. Below are the logs:

 [2017-12-20T01:40:21,100][ERROR][o.e.x.m.c.i.IndexRecoveryCollector] [es-master-01] collector [index-recovery] timed out when collecting data
[2017-12-20T01:40:31,111][ERROR][o.e.x.m.c.c.ClusterStatsCollector] [es-master-01] collector [cluster-stats] timed out when collecting data
[2017-12-20T01:40:45,296][WARN ][o.e.t.TransportService   ] [es-master-01] Received response for a request that has timed out, sent [43832ms] ago, timed out [28832ms] ago, action [cluster:monitor/nodes/stats[n]], node [{es-data-01}{_DgZIHKOTZmOFL_LugqVkA}{jeGORzaKQpmc0FCnRvZ9tw}{172.30.0.165}{172.30.0.165:9300}{ml.enabled=true}], id [387823346]
[2017-12-20T01:41:13,101][INFO ][o.e.c.r.a.AllocationService] [es-master-01] Cluster health status changed from [GREEN] to [YELLOW] (reason: [{es-data-01}{_DgZIHKOTZmOFL_LugqVkA}{jeGORzaKQpmc0FCnRvZ9tw}{172.30.0.165}{172.30.0.165:9300}{ml.enabled=true} failed to ping, tried [3] times, each with maximum [30s] timeout]).
[2017-12-20T01:41:13,102][INFO ][o.e.c.s.ClusterService   ] [es-master-01] removed {{es-data-01}{_DgZIHKOTZmOFL_LugqVkA}{jeGORzaKQpmc0FCnRvZ9tw}{172.30.0.165}{172.30.0.165:9300}{ml.enabled=true},}, reason: zen-disco-node-failed({es-data-01}{_DgZIHKOTZmOFL_LugqVkA}{jeGORzaKQpmc0FCnRvZ9tw}{172.30.0.165}{172.30.0.165:9300}{ml.enabled=true}), reason(failed to ping, tried [3] times, each with maximum [30s] timeout)[{es-data-01}{_DgZIHKOTZmOFL_LugqVkA}{jeGORzaKQpmc0FCnRvZ9tw}{172.30.0.165}{172.30.0.165:9300}{ml.enabled=true} failed to ping, tried [3] times, each with maximum [30s] timeout]
[2017-12-20T01:41:14,090][INFO ][o.e.c.r.DelayedAllocationService] [es-master-01] scheduling reroute for delayed shards in [58.9s] (143 delayed shards)
[2017-12-20T01:41:14,094][INFO ][o.e.c.m.MetaDataUpdateSettingsService] [es-master-01] updating number_of_replicas to [4] for indices [.security]
[2017-12-20T01:41:14,105][INFO ][o.e.c.m.MetaDataUpdateSettingsService] [es-master-01] [.security/ZykDPegDQi2timNcZY9nxA] auto expanded replicas to [4]
[2017-12-20T01:41:17,651][INFO ][o.e.c.s.ClusterService   ] [es-master-01] added {{es-data-01}{_DgZIHKOTZmOFL_LugqVkA}{jeGORzaKQpmc0FCnRvZ9tw}{172.30.0.165}{172.30.0.165:9300}{ml.enabled=true},}, reason: zen-disco-node-join[{es-data-01}{_DgZIHKOTZmOFL_LugqVkA}{jeGORzaKQpmc0FCnRvZ9tw}{172.30.0.165}{172.30.0.165:9300}{ml.enabled=true}]
[2017-12-20T01:41:18,785][INFO ][o.e.c.m.MetaDataUpdateSettingsService] [es-master-01] updating number_of_replicas to [5] for indices [.security]
[2017-12-20T01:41:18,803][INFO ][o.e.c.m.MetaDataUpdateSettingsService] [es-master-01] [.security/ZykDPegDQi2timNcZY9nxA] auto expanded replicas to [5]
[2017-12-20T01:41:23,465][WARN ][o.e.c.a.s.ShardStateAction] [es-master-01] [.monitoring-kibana-2-2017.12.20][0] received shard failed for shard id [[.monitoring-kibana-2-2017.12.20][0]], allocation id [7jpAsMGoTeyFHbs7wOcQbQ], primary term [1], message [mark copy as stale]
[2017-12-20T01:41:34,925][INFO ][o.e.c.m.MetaDataMappingService] [es-master-01] [.monitoring-kibana-2-2017.12.20/yOY906gSSUKfJ3g5JBs-aw] update_mapping [kibana_stats]
[2017-12-20T01:42:07,546][INFO ][o.e.c.r.a.AllocationService] [es-master-01] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[.monitoring-kibana-2-2017.12.20][0]] ...]).
[2017-12-20T11:36:01,869][ERROR][o.e.x.m.c.i.IndicesStatsCollector] [es-master-01] collector [indices-stats] timed out when collecting data
[2017-12-20T11:36:08,178][DEBUG][o.e.a.a.c.n.s.TransportNodesStatsAction] [es-master-01] failed to execute on node [4qAddhRXSkuHiAadNVsKtA]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [es-data-06][172.30.0.209:9300][cluster:monitor/nodes/stats[n]] request_id [389236027] timed out after [15000ms]
        at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:925) [elasticsearch-5.4.0.jar:5.4.0]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:569) [elasticsearch-5.4.0.jar:5.4.0]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_131]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_131]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131]
[2017-12-20T11:36:08,178][WARN ][o.e.a.a.c.n.s.TransportNodesStatsAction] [es-master-01] not accumulating exceptions, excluding exception from response
org.elasticsearch.action.FailedNodeException: Failed node [4qAddhRXSkuHiAadNVsKtA]
        at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction.onFailure(TransportNodesAction.java:246) [elasticsearch-5.4.0.jar:5.4.0]
        at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction.access$200(TransportNodesAction.java:160) [elasticsearch-5.4.0.jar:5.4.0]
        at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction$1.handleException(TransportNodesAction.java:218) [elasticsearch-5.4.0.jar:5.4.0]
        at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1041) [elasticsearch-5.4.0.jar:5.4.0]
        at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:924) [elasticsearch-5.4.0.jar:5.4.0]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:569) [elasticsearch-5.4.0.jar:5.4.0]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_131]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_131]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131]
Caused by: org.elasticsearch.transport.ReceiveTimeoutTransportException: [es-data-06][172.30.0.209:9300][cluster:monitor/nodes/stats[n]] request_id [389236027] timed out after [15000ms]
        at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:925) ~[elasticsearch-5.4.0.jar:5.4.0]
        ... 4 more

Did someone face this issue earlier? Please suggest what needs to be done. Thank you.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.