Hi all,
I have a three node cluster elasticsearch version 6.5.4. Some times data in visualize and dashboard falling behind and it takes hours to back. Based on monitoring dashboard of elasticsearch, index latency increased In a dramatic way in this scenarios. also, "Time spent performing refresh" increased Significantly.
Nodes' names ar as SRV1,SRV2,SRV3 , and their logs are as following in the time of starting incident. it is noted that SRV2 is master
SRV1:
[2020-11-25T11:32:34,309][ERROR][o.e.x.m.c.n.NodeStatsCollector] [node-SRV1] collector [node_stats] timed out when collecting data
SRV2
2020-11-25T11:32:25,050][ERROR][o.e.x.m.c.n.NodeStatsCollector] [node-SRV2] collector [node_stats] timed out when collecting data
[2020-11-25T11:32:35,066][ERROR][o.e.x.m.c.i.IndexStatsCollector] [node-SRV2] collector [index-stats] timed out when collecting data
[2020-11-25T11:34:50,771][ERROR][o.e.x.m.c.n.NodeStatsCollector] [node-SRV2] collector [node_stats] timed out when collecting data
[2020-11-25T11:36:11,288][DEBUG][o.e.a.a.c.n.s.TransportNodesStatsAction] [node-SRV2] failed to execute on node [6U240ajMSruuV3OR15F9hQ]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [node-SRV2][10.0.54.11:9300][cluster:monitor/nodes/stats[n]] request_id [160653209] timed out after [14829ms]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:1038) [elasticsearch-6.5.4.jar:6.5.4]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:624) [elasticsearch-6.5.4.jar:6.5.4]
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [?:1.8.0_152]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) [?:1.8.0_152]
at java.lang.Thread.run(Unknown Source) [?:1.8.0_152]
[2020-11-25T11:36:12,053][WARN ][o.e.t.TransportService ] [node-SRV2] Received response for a request that has timed out, sent [15641ms] ago, timed out [812ms] ago, action [cluster:monitor/nodes/stats[n]], node [{node-SRV2}{6U240ajMSruuV3OR15F9hQ}{qR87yY7RTsS8TBl6dBao1A}{10.0.54.11}{10.0.54.11:9300}{ml.machine_memory=17179262976, xpack.installed=true, ml.max_open_jobs=20, ml.enabled=true}], id [160653209]
[2020-11-25T11:40:11,262][ERROR][o.e.x.m.c.n.NodeStatsCollector] [node-SRV2] collector [node_stats] timed out when collecting data
[2020-11-25T11:43:13,625][ERROR][o.e.x.m.c.i.IndexStatsCollector] [node-SRV2] collector [index-stats] timed out when collecting data
[2020-11-25T11:44:15,909][ERROR][o.e.x.m.c.c.ClusterStatsCollector] [node-SRV2] collector [cluster_stats] timed out when collecting data
[2020-11-25T11:44:31,519][ERROR][o.e.x.m.c.n.NodeStatsCollector] [node-SRV2] collector [node_stats] timed out when collecting data
[2020-11-25T11:45:11,535][ERROR][o.e.x.m.c.i.IndexRecoveryCollector] [node-SRV2] collector [index_recovery] timed out when collecting data
[2020-11-25T11:45:21,551][ERROR][o.e.x.m.c.n.NodeStatsCollector] [node-SRV2] collector [node_stats] timed out when collecting data
[2020-11-25T11:49:01,820][ERROR][o.e.x.m.c.i.IndexRecoveryCollector] [node-SRV2] collector [index_recovery] timed out when collecting data
[2020-11-25T11:49:11,836][ERROR][o.e.x.m.c.n.NodeStatsCollector] [node-SRV2] collector [node_stats] timed out when collecting data
[2020-11-25T11:49:21,851][ERROR][o.e.x.m.c.i.IndexStatsCollector] [node-SRV2] collector [index-stats] timed out when collecting data
[2020-11-25T11:49:52,664][ERROR][o.e.x.m.c.n.NodeStatsCollector] [node-SRV2] collector [node_stats] timed out when collecting data
[2020-11-25T11:50:32,977][ERROR][o.e.x.m.c.c.ClusterStatsCollector] [node-SRV2] collector [cluster_stats] timed out when collecting data
[2020-11-25T11:51:12,009][ERROR][o.e.x.m.c.n.NodeStatsCollector] [node-SRV2] collector [node_stats] timed out when collecting data
[2020-11-25T11:52:12,323][ERROR][o.e.x.m.c.n.NodeStatsCollector] [node-SRV2] collector [node_stats] timed out when collecting data
[2020-11-25T12:40:56,224][ERROR][o.e.x.m.c.n.NodeStatsCollector] [node-SRV2] collector [node_stats] timed out when collecting data
[2020-11-25T12:42:00,662][ERROR][o.e.x.m.c.c.ClusterStatsCollector] [node-SRV2] collector [cluster_stats] timed out when collecting data
[2020-11-25T12:45:30,978][ERROR][o.e.x.m.c.c.ClusterStatsCollector] [node-SRV2] collector [cluster_stats] timed out when collecting data
[2020-11-25T12:46:06,511][ERROR][o.e.x.m.c.n.NodeStatsCollector] [node-SRV2] collector [node_stats] timed out when collecting data
[2020-11-25T12:49:18,235][ERROR][o.e.x.m.c.n.NodeStatsCollector] [node-SRV2] collector [node_stats] timed out when collecting data
[2020-11-25T12:56:47,387][ERROR][o.e.x.m.c.i.IndexRecoveryCollector] [node-SRV2] collector [index_recovery] timed out when collecting data
[2020-11-25T12:57:17,418][ERROR][o.e.x.m.c.i.IndexRecoveryCollector] [node-SRV2] collector [index_recovery] timed out when collecting data
[2020-11-25T12:58:43,436][ERROR][o.e.x.m.c.c.ClusterStatsCollector] [node-SRV2] collector [cluster_stats] timed out when collecting data
[2020-11-25T12:59:57,594][ERROR][o.e.x.m.c.i.IndexRecoveryCollector] [node-SRV2] collector [index_recovery] timed out when collecting data
[2020-11-25T13:00:07,610][ERROR][o.e.x.m.c.n.NodeStatsCollector] [node-SRV2] collector [node_stats] timed out when collecting data
[2020-11-25T13:00:17,626][ERROR][o.e.x.m.c.i.IndexStatsCollector] [node-SRV2] collector [index-stats] timed out when collecting data
[2020-11-25T13:00:27,641][ERROR][o.e.x.m.c.c.ClusterStatsCollector] [node-SRV2] collector [cluster_stats] timed out when collecting data
[2020-11-25T13:00:57,626][ERROR][o.e.x.m.c.i.IndexRecoveryCollector] [node-SRV2] collector [index_recovery] timed out when collecting data
[2020-11-25T13:01:07,642][ERROR][o.e.x.m.c.n.NodeStatsCollector] [node-SRV2] collector [node_stats] timed out when collecting data
[2020-11-25T13:01:17,658][ERROR][o.e.x.m.c.i.IndexStatsCollector] [node-SRV2] collector [index-stats] timed out when collecting data
[2020-11-25T13:01:23,330][INFO ][o.e.m.j.JvmGcMonitorService] [node-SRV2] [gc][661032] overhead, spent [332ms] collecting in the last [1s]
[2020-11-25T13:01:22,877][ERROR][o.e.a.b.TransportBulkAction] [node-SRV2] failed to execute pipeline for a bulk request
org.elasticsearch.common.util.concurrent.EsRejectedExecutionException: rejected execution of org.elasticsearch.ingest.IngestService$4@3b16a82f on EsThreadPoolExecutor[name = node-SRV2/write, queue capacity = 200, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@39a16dfd[Running, pool size = 16, active threads = 16, queued tasks = 200, completed tasks = 34125052]]
at org.elasticsearch.common.util.concurrent.EsAbortPolicy.rejectedExecution(EsAbortPolicy.java:48) ~[elasticsearch-6.5.4.jar:6.5.4]
at java.util.concurrent.ThreadPoolExecutor.reject(Unknown Source) ~[?:1.8.0_152]
at java.util.concurrent.ThreadPoolExecutor.execute(Unknown Source) ~[?:1.8.0_152]
at org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor.doExecute(EsThreadPoolExecutor.java:98) ~[elasticsearch-6.5.4.jar:6.5.4]
at org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor.execute(EsThreadPoolExecutor.java:93) ~[elasticsearch-6.5.4.jar:6.5.4]
at org.elasticsearch.ingest.IngestService.executeBulkRequest(IngestService.java:381) ~[elasticsearch-6.5.4.jar:6.5.4]
at org.elasticsearch.action.bulk.TransportBulkAction.processBulkIndexIngestRequest(TransportBulkAction.java:530) ~[elasticsearch-6.5.4.jar:6.5.4]
at org.elasticsearch.action.bulk.TransportBulkAction.executeIngestAndBulk(TransportBulkAction.java:235) ~[elasticsearch-6.5.4.jar:6.5.4]
at org.elasticsearch.action.bulk.TransportBulkAction.doExecute(TransportBulkAction.java:169) ~[elasticsearch-6.5.4.jar:6.5.4]
at org.elasticsearch.action.bulk.TransportBulkAction.doExecute(TransportBulkAction.java:89) ~[elasticsearch-6.5.4.jar:6.5.4]
at org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:167) ~[elasticsearch-6.5.4.jar:6.5.4]
at org.elasticsearch.xpack.security.action.filter.SecurityActionFilter.apply(SecurityActionFilter.java:126) ~[?:?]
at org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:165) ~[elasticsearch-6.5.4.jar:6.5.4]
at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:139) ~[elasticsearch-6.5.4.jar:6.5.4]
at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:81) ~[elasticsearch-6.5.4.jar:6.5.4]
at org.elasticsearch.client.node.NodeClient.executeLocally(NodeClient.java:87) ~[elasticsearch-6.5.4.jar:6.5.4]
at org.elasticsearch.client.node.NodeClient.doExecute(NodeClient.java:76) ~[elasticsearch-6.5.4.jar:6.5.4]
at org.elasticsearch.client.support.AbstractClient.execute(AbstractClient.java:395) ~[elasticsearch-6.5.4.jar:6.5.4]
at org.elasticsearch.client.support.AbstractClient.bulk(AbstractClient.java:472) ~[elasticsearch-6.5.4.jar:6.5.4]
at org.elasticsearch.xpack.core.ClientHelper.executeAsyncWithOrigin(ClientHelper.java:74) ~[x-pack-core-6.5.4.jar:6.5.4]
at org.elasticsearch.xpack.monitoring.exporter.local.LocalBulk.doFlush(LocalBulk.java:108) ~[?:?]
at org.elasticsearch.xpack.monitoring.exporter.ExportBulk.flush(ExportBulk.java:60) ~[?:?]
at org.elasticsearch.xpack.monitoring.exporter.ExportBulk$Compound.lambda$doFlush$1(ExportBulk.java:154) ~[?:?]
at org.elasticsearch.xpack.core.common.IteratingActionListener.run(IteratingActionListener.java:102) [x-pack-core-6.5.4.jar:6.5.4]
at org.elasticsearch.xpack.monitoring.exporter.ExportBulk$Compound.doFlush(ExportBulk.java:170) [x-pack-monitoring-6.5.4.jar:6.5.4]
at org.elasticsearch.xpack.monitoring.exporter.ExportBulk.flushAndClose(ExportBulk.java:84) [x-pack-monitoring-6.5.4.jar:6.5.4]
at org.elasticsearch.xpack.monitoring.exporter.ExportBulk.close(ExportBulk.java:74) [x-pack-monitoring-6.5.4.jar:6.5.4]
at org.elasticsearch.xpack.monitoring.exporter.Exporters.export(Exporters.java:196) [x-pack-monitoring-6.5.4.jar:6.5.4]
at org.elasticsearch.xpack.monitoring.action.TransportMonitoringBulkAction$AsyncAction$1.doRun(TransportMonitoringBulkAction.java:147) [x-pack-monitoring-6.5.4.jar:6.5.4]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:723) [elasticsearch-6.5.4.jar:6.5.4]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-6.5.4.jar:6.5.4]
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [?:1.8.0_152]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) [?:1.8.0_152]
at java.lang.Thread.run(Unknown Source) [?:1.8.0_152]
............
[2020-11-25T13:02:37,800][ERROR][o.e.x.m.c.n.NodeStatsCollector] [node-SRV2] collector [node_stats] timed out when collecting data
[2020-11-25T13:02:47,816][ERROR][o.e.x.m.c.i.IndexStatsCollector] [SRV2] collector [index-stats] timed out when collecting data
[2020-11-25T13:03:24,097][ERROR][o.e.x.m.c.c.ClusterStatsCollector] [SRV2] collector [cluster_stats] timed out when collecting data
[2020-11-25T13:03:57,878][ERROR][o.e.x.m.c.i.IndexRecoveryCollector] [SRV2] collector [index_recovery] timed out when collecting data
[2020-11-25T13:04:07,894][ERROR][o.e.x.m.c.n.NodeStatsCollector] [SRV2] collector [node_stats] timed out when collecting data
[2020-11-25T13:04:39,112][ERROR][o.e.x.m.c.n.NodeStatsCollector] [SRV2] collector [node_stats] timed out when collecting data
[2020-11-25T13:04:49,128][ERROR][o.e.x.m.c.i.IndexStatsCollector] [SRV2] collector [index-stats] timed out when collecting data
[2020-11-25T13:05:28,097][ERROR][o.e.x.m.c.n.NodeStatsCollector] [SRV2] collector [node_stats] timed out when collecting data
[2020-11-25T13:06:07,097][ERROR][o.e.x.m.c.c.ClusterStatsCollector] [SRV2] collector [cluster_stats] timed out when collecting data
[2020-11-25T13:06:19,488][INFO ][o.e.m.j.JvmGcMonitorService] [SRV2] [gc][661324] overhead, spent [341ms] collecting in the last [1s]
[2020-11-25T13:26:09,624][ERROR][o.e.x.m.c.n.NodeStatsCollector] [SRV2] collector [node_stats] timed out when collecting data
SRV3:
[2020-11-25T11:17:13,843][INFO ][o.e.m.j.JvmGcMonitorService] [SRV3] [gc][17827888] overhead, spent [295ms] collecting in the last [1s]