High refresh time in cluster nodes

Hi all,
I have a three node cluster elasticsearch version 6.5.4. Some times data in visualize and dashboard falling behind and it takes hours to back. Based on monitoring dashboard of elasticsearch, index latency increased In a dramatic way in this scenarios. also, "Time spent performing refresh" increased Significantly.
Nodes' names ar as SRV1,SRV2,SRV3 , and their logs are as following in the time of starting incident. it is noted that SRV2 is master
SRV1:
[2020-11-25T11:32:34,309][ERROR][o.e.x.m.c.n.NodeStatsCollector] [node-SRV1] collector [node_stats] timed out when collecting data

SRV2

2020-11-25T11:32:25,050][ERROR][o.e.x.m.c.n.NodeStatsCollector] [node-SRV2] collector [node_stats] timed out when collecting data
    [2020-11-25T11:32:35,066][ERROR][o.e.x.m.c.i.IndexStatsCollector] [node-SRV2] collector [index-stats] timed out when collecting data
    [2020-11-25T11:34:50,771][ERROR][o.e.x.m.c.n.NodeStatsCollector] [node-SRV2] collector [node_stats] timed out when collecting data
    [2020-11-25T11:36:11,288][DEBUG][o.e.a.a.c.n.s.TransportNodesStatsAction] [node-SRV2] failed to execute on node [6U240ajMSruuV3OR15F9hQ]
    org.elasticsearch.transport.ReceiveTimeoutTransportException: [node-SRV2][10.0.54.11:9300][cluster:monitor/nodes/stats[n]] request_id [160653209] timed out after [14829ms]
    	at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:1038) [elasticsearch-6.5.4.jar:6.5.4]
    	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:624) [elasticsearch-6.5.4.jar:6.5.4]
    	at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [?:1.8.0_152]
    	at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) [?:1.8.0_152]
    	at java.lang.Thread.run(Unknown Source) [?:1.8.0_152]
    [2020-11-25T11:36:12,053][WARN ][o.e.t.TransportService   ] [node-SRV2] Received response for a request that has timed out, sent [15641ms] ago, timed out [812ms] ago, action [cluster:monitor/nodes/stats[n]], node [{node-SRV2}{6U240ajMSruuV3OR15F9hQ}{qR87yY7RTsS8TBl6dBao1A}{10.0.54.11}{10.0.54.11:9300}{ml.machine_memory=17179262976, xpack.installed=true, ml.max_open_jobs=20, ml.enabled=true}], id [160653209]
    [2020-11-25T11:40:11,262][ERROR][o.e.x.m.c.n.NodeStatsCollector] [node-SRV2] collector [node_stats] timed out when collecting data
    [2020-11-25T11:43:13,625][ERROR][o.e.x.m.c.i.IndexStatsCollector] [node-SRV2] collector [index-stats] timed out when collecting data
    [2020-11-25T11:44:15,909][ERROR][o.e.x.m.c.c.ClusterStatsCollector] [node-SRV2] collector [cluster_stats] timed out when collecting data
    [2020-11-25T11:44:31,519][ERROR][o.e.x.m.c.n.NodeStatsCollector] [node-SRV2] collector [node_stats] timed out when collecting data
    [2020-11-25T11:45:11,535][ERROR][o.e.x.m.c.i.IndexRecoveryCollector] [node-SRV2] collector [index_recovery] timed out when collecting data
    [2020-11-25T11:45:21,551][ERROR][o.e.x.m.c.n.NodeStatsCollector] [node-SRV2] collector [node_stats] timed out when collecting data
    [2020-11-25T11:49:01,820][ERROR][o.e.x.m.c.i.IndexRecoveryCollector] [node-SRV2] collector [index_recovery] timed out when collecting data
    [2020-11-25T11:49:11,836][ERROR][o.e.x.m.c.n.NodeStatsCollector] [node-SRV2] collector [node_stats] timed out when collecting data
    [2020-11-25T11:49:21,851][ERROR][o.e.x.m.c.i.IndexStatsCollector] [node-SRV2] collector [index-stats] timed out when collecting data
    [2020-11-25T11:49:52,664][ERROR][o.e.x.m.c.n.NodeStatsCollector] [node-SRV2] collector [node_stats] timed out when collecting data
    [2020-11-25T11:50:32,977][ERROR][o.e.x.m.c.c.ClusterStatsCollector] [node-SRV2] collector [cluster_stats] timed out when collecting data
    [2020-11-25T11:51:12,009][ERROR][o.e.x.m.c.n.NodeStatsCollector] [node-SRV2] collector [node_stats] timed out when collecting data
    [2020-11-25T11:52:12,323][ERROR][o.e.x.m.c.n.NodeStatsCollector] [node-SRV2] collector [node_stats] timed out when collecting data
    [2020-11-25T12:40:56,224][ERROR][o.e.x.m.c.n.NodeStatsCollector] [node-SRV2] collector [node_stats] timed out when collecting data
    [2020-11-25T12:42:00,662][ERROR][o.e.x.m.c.c.ClusterStatsCollector] [node-SRV2] collector [cluster_stats] timed out when collecting data
    [2020-11-25T12:45:30,978][ERROR][o.e.x.m.c.c.ClusterStatsCollector] [node-SRV2] collector [cluster_stats] timed out when collecting data
    [2020-11-25T12:46:06,511][ERROR][o.e.x.m.c.n.NodeStatsCollector] [node-SRV2] collector [node_stats] timed out when collecting data
    [2020-11-25T12:49:18,235][ERROR][o.e.x.m.c.n.NodeStatsCollector] [node-SRV2] collector [node_stats] timed out when collecting data
    [2020-11-25T12:56:47,387][ERROR][o.e.x.m.c.i.IndexRecoveryCollector] [node-SRV2] collector [index_recovery] timed out when collecting data
    [2020-11-25T12:57:17,418][ERROR][o.e.x.m.c.i.IndexRecoveryCollector] [node-SRV2] collector [index_recovery] timed out when collecting data
    [2020-11-25T12:58:43,436][ERROR][o.e.x.m.c.c.ClusterStatsCollector] [node-SRV2] collector [cluster_stats] timed out when collecting data
    [2020-11-25T12:59:57,594][ERROR][o.e.x.m.c.i.IndexRecoveryCollector] [node-SRV2] collector [index_recovery] timed out when collecting data
    [2020-11-25T13:00:07,610][ERROR][o.e.x.m.c.n.NodeStatsCollector] [node-SRV2] collector [node_stats] timed out when collecting data
    [2020-11-25T13:00:17,626][ERROR][o.e.x.m.c.i.IndexStatsCollector] [node-SRV2] collector [index-stats] timed out when collecting data
    [2020-11-25T13:00:27,641][ERROR][o.e.x.m.c.c.ClusterStatsCollector] [node-SRV2] collector [cluster_stats] timed out when collecting data
    [2020-11-25T13:00:57,626][ERROR][o.e.x.m.c.i.IndexRecoveryCollector] [node-SRV2] collector [index_recovery] timed out when collecting data
    [2020-11-25T13:01:07,642][ERROR][o.e.x.m.c.n.NodeStatsCollector] [node-SRV2] collector [node_stats] timed out when collecting data
    [2020-11-25T13:01:17,658][ERROR][o.e.x.m.c.i.IndexStatsCollector] [node-SRV2] collector [index-stats] timed out when collecting data
    [2020-11-25T13:01:23,330][INFO ][o.e.m.j.JvmGcMonitorService] [node-SRV2] [gc][661032] overhead, spent [332ms] collecting in the last [1s]
    [2020-11-25T13:01:22,877][ERROR][o.e.a.b.TransportBulkAction] [node-SRV2] failed to execute pipeline for a bulk request
    org.elasticsearch.common.util.concurrent.EsRejectedExecutionException: rejected execution of org.elasticsearch.ingest.IngestService$4@3b16a82f on EsThreadPoolExecutor[name = node-SRV2/write, queue capacity = 200, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@39a16dfd[Running, pool size = 16, active threads = 16, queued tasks = 200, completed tasks = 34125052]]
    	at org.elasticsearch.common.util.concurrent.EsAbortPolicy.rejectedExecution(EsAbortPolicy.java:48) ~[elasticsearch-6.5.4.jar:6.5.4]
    	at java.util.concurrent.ThreadPoolExecutor.reject(Unknown Source) ~[?:1.8.0_152]
    	at java.util.concurrent.ThreadPoolExecutor.execute(Unknown Source) ~[?:1.8.0_152]
    	at org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor.doExecute(EsThreadPoolExecutor.java:98) ~[elasticsearch-6.5.4.jar:6.5.4]
    	at org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor.execute(EsThreadPoolExecutor.java:93) ~[elasticsearch-6.5.4.jar:6.5.4]
    	at org.elasticsearch.ingest.IngestService.executeBulkRequest(IngestService.java:381) ~[elasticsearch-6.5.4.jar:6.5.4]
    	at org.elasticsearch.action.bulk.TransportBulkAction.processBulkIndexIngestRequest(TransportBulkAction.java:530) ~[elasticsearch-6.5.4.jar:6.5.4]
    	at org.elasticsearch.action.bulk.TransportBulkAction.executeIngestAndBulk(TransportBulkAction.java:235) ~[elasticsearch-6.5.4.jar:6.5.4]
    	at org.elasticsearch.action.bulk.TransportBulkAction.doExecute(TransportBulkAction.java:169) ~[elasticsearch-6.5.4.jar:6.5.4]
    	at org.elasticsearch.action.bulk.TransportBulkAction.doExecute(TransportBulkAction.java:89) ~[elasticsearch-6.5.4.jar:6.5.4]
    	at org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:167) ~[elasticsearch-6.5.4.jar:6.5.4]
    	at org.elasticsearch.xpack.security.action.filter.SecurityActionFilter.apply(SecurityActionFilter.java:126) ~[?:?]
    	at org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:165) ~[elasticsearch-6.5.4.jar:6.5.4]
    	at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:139) ~[elasticsearch-6.5.4.jar:6.5.4]
    	at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:81) ~[elasticsearch-6.5.4.jar:6.5.4]
    	at org.elasticsearch.client.node.NodeClient.executeLocally(NodeClient.java:87) ~[elasticsearch-6.5.4.jar:6.5.4]
    	at org.elasticsearch.client.node.NodeClient.doExecute(NodeClient.java:76) ~[elasticsearch-6.5.4.jar:6.5.4]
    	at org.elasticsearch.client.support.AbstractClient.execute(AbstractClient.java:395) ~[elasticsearch-6.5.4.jar:6.5.4]
    	at org.elasticsearch.client.support.AbstractClient.bulk(AbstractClient.java:472) ~[elasticsearch-6.5.4.jar:6.5.4]
    	at org.elasticsearch.xpack.core.ClientHelper.executeAsyncWithOrigin(ClientHelper.java:74) ~[x-pack-core-6.5.4.jar:6.5.4]
    	at org.elasticsearch.xpack.monitoring.exporter.local.LocalBulk.doFlush(LocalBulk.java:108) ~[?:?]
    	at org.elasticsearch.xpack.monitoring.exporter.ExportBulk.flush(ExportBulk.java:60) ~[?:?]
    	at org.elasticsearch.xpack.monitoring.exporter.ExportBulk$Compound.lambda$doFlush$1(ExportBulk.java:154) ~[?:?]
    	at org.elasticsearch.xpack.core.common.IteratingActionListener.run(IteratingActionListener.java:102) [x-pack-core-6.5.4.jar:6.5.4]
    	at org.elasticsearch.xpack.monitoring.exporter.ExportBulk$Compound.doFlush(ExportBulk.java:170) [x-pack-monitoring-6.5.4.jar:6.5.4]
    	at org.elasticsearch.xpack.monitoring.exporter.ExportBulk.flushAndClose(ExportBulk.java:84) [x-pack-monitoring-6.5.4.jar:6.5.4]
    	at org.elasticsearch.xpack.monitoring.exporter.ExportBulk.close(ExportBulk.java:74) [x-pack-monitoring-6.5.4.jar:6.5.4]
    	at org.elasticsearch.xpack.monitoring.exporter.Exporters.export(Exporters.java:196) [x-pack-monitoring-6.5.4.jar:6.5.4]
    	at org.elasticsearch.xpack.monitoring.action.TransportMonitoringBulkAction$AsyncAction$1.doRun(TransportMonitoringBulkAction.java:147) [x-pack-monitoring-6.5.4.jar:6.5.4]
    	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:723) [elasticsearch-6.5.4.jar:6.5.4]
    	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-6.5.4.jar:6.5.4]
    	at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [?:1.8.0_152]
    	at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) [?:1.8.0_152]
    	at java.lang.Thread.run(Unknown Source) [?:1.8.0_152]
............
[2020-11-25T13:02:37,800][ERROR][o.e.x.m.c.n.NodeStatsCollector] [node-SRV2] collector [node_stats] timed out when collecting data
[2020-11-25T13:02:47,816][ERROR][o.e.x.m.c.i.IndexStatsCollector] [SRV2] collector [index-stats] timed out when collecting data
[2020-11-25T13:03:24,097][ERROR][o.e.x.m.c.c.ClusterStatsCollector] [SRV2] collector [cluster_stats] timed out when collecting data
[2020-11-25T13:03:57,878][ERROR][o.e.x.m.c.i.IndexRecoveryCollector] [SRV2] collector [index_recovery] timed out when collecting data
[2020-11-25T13:04:07,894][ERROR][o.e.x.m.c.n.NodeStatsCollector] [SRV2] collector [node_stats] timed out when collecting data
[2020-11-25T13:04:39,112][ERROR][o.e.x.m.c.n.NodeStatsCollector] [SRV2] collector [node_stats] timed out when collecting data
[2020-11-25T13:04:49,128][ERROR][o.e.x.m.c.i.IndexStatsCollector] [SRV2] collector [index-stats] timed out when collecting data
[2020-11-25T13:05:28,097][ERROR][o.e.x.m.c.n.NodeStatsCollector] [SRV2] collector [node_stats] timed out when collecting data
[2020-11-25T13:06:07,097][ERROR][o.e.x.m.c.c.ClusterStatsCollector] [SRV2] collector [cluster_stats] timed out when collecting data
[2020-11-25T13:06:19,488][INFO ][o.e.m.j.JvmGcMonitorService] [SRV2] [gc][661324] overhead, spent [341ms] collecting in the last [1s]
[2020-11-25T13:26:09,624][ERROR][o.e.x.m.c.n.NodeStatsCollector] [SRV2] collector [node_stats] timed out when collecting data

SRV3:

[2020-11-25T11:17:13,843][INFO ][o.e.m.j.JvmGcMonitorService] [SRV3] [gc][17827888] overhead, spent [295ms] collecting in the last [1s]

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.