We started to see this issue with our Elasticsearch 5.4.0 Cluster. Below are the logs:
[2017-12-20T01:40:21,100][ERROR][o.e.x.m.c.i.IndexRecoveryCollector] [es-master-01] collector [index-recovery] timed out when collecting data
[2017-12-20T01:40:31,111][ERROR][o.e.x.m.c.c.ClusterStatsCollector] [es-master-01] collector [cluster-stats] timed out when collecting data
[2017-12-20T01:40:45,296][WARN ][o.e.t.TransportService ] [es-master-01] Received response for a request that has timed out, sent [43832ms] ago, timed out [28832ms] ago, action [cluster:monitor/nodes/stats[n]], node [{es-data-01}{_DgZIHKOTZmOFL_LugqVkA}{jeGORzaKQpmc0FCnRvZ9tw}{172.30.0.165}{172.30.0.165:9300}{ml.enabled=true}], id [387823346]
[2017-12-20T01:41:13,101][INFO ][o.e.c.r.a.AllocationService] [es-master-01] Cluster health status changed from [GREEN] to [YELLOW] (reason: [{es-data-01}{_DgZIHKOTZmOFL_LugqVkA}{jeGORzaKQpmc0FCnRvZ9tw}{172.30.0.165}{172.30.0.165:9300}{ml.enabled=true} failed to ping, tried [3] times, each with maximum [30s] timeout]).
[2017-12-20T01:41:13,102][INFO ][o.e.c.s.ClusterService ] [es-master-01] removed {{es-data-01}{_DgZIHKOTZmOFL_LugqVkA}{jeGORzaKQpmc0FCnRvZ9tw}{172.30.0.165}{172.30.0.165:9300}{ml.enabled=true},}, reason: zen-disco-node-failed({es-data-01}{_DgZIHKOTZmOFL_LugqVkA}{jeGORzaKQpmc0FCnRvZ9tw}{172.30.0.165}{172.30.0.165:9300}{ml.enabled=true}), reason(failed to ping, tried [3] times, each with maximum [30s] timeout)[{es-data-01}{_DgZIHKOTZmOFL_LugqVkA}{jeGORzaKQpmc0FCnRvZ9tw}{172.30.0.165}{172.30.0.165:9300}{ml.enabled=true} failed to ping, tried [3] times, each with maximum [30s] timeout]
[2017-12-20T01:41:14,090][INFO ][o.e.c.r.DelayedAllocationService] [es-master-01] scheduling reroute for delayed shards in [58.9s] (143 delayed shards)
[2017-12-20T01:41:14,094][INFO ][o.e.c.m.MetaDataUpdateSettingsService] [es-master-01] updating number_of_replicas to [4] for indices [.security]
[2017-12-20T01:41:14,105][INFO ][o.e.c.m.MetaDataUpdateSettingsService] [es-master-01] [.security/ZykDPegDQi2timNcZY9nxA] auto expanded replicas to [4]
[2017-12-20T01:41:17,651][INFO ][o.e.c.s.ClusterService ] [es-master-01] added {{es-data-01}{_DgZIHKOTZmOFL_LugqVkA}{jeGORzaKQpmc0FCnRvZ9tw}{172.30.0.165}{172.30.0.165:9300}{ml.enabled=true},}, reason: zen-disco-node-join[{es-data-01}{_DgZIHKOTZmOFL_LugqVkA}{jeGORzaKQpmc0FCnRvZ9tw}{172.30.0.165}{172.30.0.165:9300}{ml.enabled=true}]
[2017-12-20T01:41:18,785][INFO ][o.e.c.m.MetaDataUpdateSettingsService] [es-master-01] updating number_of_replicas to [5] for indices [.security]
[2017-12-20T01:41:18,803][INFO ][o.e.c.m.MetaDataUpdateSettingsService] [es-master-01] [.security/ZykDPegDQi2timNcZY9nxA] auto expanded replicas to [5]
[2017-12-20T01:41:23,465][WARN ][o.e.c.a.s.ShardStateAction] [es-master-01] [.monitoring-kibana-2-2017.12.20][0] received shard failed for shard id [[.monitoring-kibana-2-2017.12.20][0]], allocation id [7jpAsMGoTeyFHbs7wOcQbQ], primary term [1], message [mark copy as stale]
[2017-12-20T01:41:34,925][INFO ][o.e.c.m.MetaDataMappingService] [es-master-01] [.monitoring-kibana-2-2017.12.20/yOY906gSSUKfJ3g5JBs-aw] update_mapping [kibana_stats]
[2017-12-20T01:42:07,546][INFO ][o.e.c.r.a.AllocationService] [es-master-01] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[.monitoring-kibana-2-2017.12.20][0]] ...]).
[2017-12-20T11:36:01,869][ERROR][o.e.x.m.c.i.IndicesStatsCollector] [es-master-01] collector [indices-stats] timed out when collecting data
[2017-12-20T11:36:08,178][DEBUG][o.e.a.a.c.n.s.TransportNodesStatsAction] [es-master-01] failed to execute on node [4qAddhRXSkuHiAadNVsKtA]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [es-data-06][172.30.0.209:9300][cluster:monitor/nodes/stats[n]] request_id [389236027] timed out after [15000ms]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:925) [elasticsearch-5.4.0.jar:5.4.0]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:569) [elasticsearch-5.4.0.jar:5.4.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_131]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_131]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131]
[2017-12-20T11:36:08,178][WARN ][o.e.a.a.c.n.s.TransportNodesStatsAction] [es-master-01] not accumulating exceptions, excluding exception from response
org.elasticsearch.action.FailedNodeException: Failed node [4qAddhRXSkuHiAadNVsKtA]
at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction.onFailure(TransportNodesAction.java:246) [elasticsearch-5.4.0.jar:5.4.0]
at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction.access$200(TransportNodesAction.java:160) [elasticsearch-5.4.0.jar:5.4.0]
at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction$1.handleException(TransportNodesAction.java:218) [elasticsearch-5.4.0.jar:5.4.0]
at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1041) [elasticsearch-5.4.0.jar:5.4.0]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:924) [elasticsearch-5.4.0.jar:5.4.0]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:569) [elasticsearch-5.4.0.jar:5.4.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_131]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_131]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131]
Caused by: org.elasticsearch.transport.ReceiveTimeoutTransportException: [es-data-06][172.30.0.209:9300][cluster:monitor/nodes/stats[n]] request_id [389236027] timed out after [15000ms]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:925) ~[elasticsearch-5.4.0.jar:5.4.0]
... 4 more
Did someone face this issue earlier? Please suggest what needs to be done. Thank you.