Elasticsearch cluster request timeout and slow response time

Hi,

we are having some problems with our Elasticsearch cluster. For the past few days the requests performed by the master node occassionally start to time out and, after that, every search query executed to elasticsearch goes in timeout and everything becomes very slow.
The cluster has worked very well until last week and we haven't made any change to the cluster. The only thing we did was restarting every node (with success) due to a router failure.
It doesn't seem like a network issue beacause every node can successfully establish a connection to the other nodes over port 9300.

Here are some examples of the timeout log by one of the node:

[2021-02-02T16:28:08,449][WARN ][o.e.x.w.e.ExecutionService] [node-warm-01] failed to execute watch [MbZd5nRsQMCTfOguDES1dQ_xpack_license_expiration]
[2021-02-02T16:28:12,002][WARN ][o.e.x.w.e.ExecutionService] [node-warm-01] failed to execute watch [MbZd5nRsQMCTfOguDES1dQ_kibana_version_mismatch]
[2021-02-02T16:28:39,460][WARN ][o.e.x.w.e.ExecutionService] [node-warm-01] failed to execute watch [MbZd5nRsQMCTfOguDES1dQ_elasticsearch_nodes]
[2021-02-02T16:29:09,975][WARN ][o.e.x.w.e.ExecutionService] [node-warm-01] failed to execute watch [MbZd5nRsQMCTfOguDES1dQ_elasticsearch_nodes]
[2021-02-02T16:31:40,012][WARN ][o.e.x.w.e.ExecutionService] [node-warm-01] failed to execute watch [MbZd5nRsQMCTfOguDES1dQ_elasticsearch_nodes]
[2021-02-02T16:35:03,777][WARN ][o.e.x.w.e.ExecutionService] [node-warm-01] failed to execute watch [MbZd5nRsQMCTfOguDES1dQ_elasticsearch_nodes]

Also, some logs by the master node:

[2021-02-02T16:42:29,063][DEBUG][o.e.a.a.c.n.s.TransportNodesStatsAction] [node-hot-02] failed to execute on node [mIKSJxiwQIWztvvbxdetIA]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [node-warm-01][10.4.0.13:9300][cluster:monitor/nodes/stats[n]] request_id [31182042] timed out after [15027ms]
        at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:1022) [elasticsearch-7.4.0.jar:7.4.0]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:703) [elasticsearch-7.4.0.jar:7.4.0]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
        at java.lang.Thread.run(Thread.java:830) [?:?]
[2021-02-02T16:42:39,070][WARN ][o.e.x.w.e.ExecutionService] [node-hot-02] failed to execute watch [MbZd5nRsQMCTfOguDES1dQ_elasticsearch_cluster_status]
[2021-02-02T16:42:39,309][WARN ][o.e.x.w.e.ExecutionService] [node-hot-02] failed to execute watch [MbZd5nRsQMCTfOguDES1dQ_logstash_version_mismatch]
[2021-02-02T16:42:39,310][WARN ][o.e.x.w.e.ExecutionService] [node-hot-02] failed to execute watch [MbZd5nRsQMCTfOguDES1dQ_elasticsearch_version_mismatch]
[2021-02-02T16:42:44,064][WARN ][o.e.c.InternalClusterInfoService] [node-hot-02] Failed to update shard information for ClusterInfoUpdateJob within 15s timeout
[2021-02-02T16:42:57,241][WARN ][o.e.t.TransportService   ] [node-hot-02] Received response for a request that has timed out, sent [43154ms] ago, timed out [28127ms] ago, action [cluster:monitor/nodes/stats[n]], node [{node-warm-01}{mIKSJxiwQIWztvvbxdetIA}{YqcLPekQRsyYb0loLQQpUw}{10.4.0.13}{10.4.0.13:9300}{dm}{xpack.installed=true, box_type=warm}], id [31182042]
[2021-02-02T16:42:57,247][INFO ][o.e.c.m.MetaDataIndexTemplateService] [node-hot-02] adding template [.management-beats] for index patterns [.management-beats]
[2021-02-02T16:43:17,935][ERROR][o.e.x.m.c.i.IndexRecoveryCollector] [node-hot-02] collector [index_recovery] timed out when collecting data
[2021-02-02T16:43:24,359][INFO ][o.e.x.s.a.AuthenticationService] [node-hot-02] Authentication of [elastic] was terminated by realm [reserved] - failed to authenticate user [elastic]
[2021-02-02T16:43:29,066][DEBUG][o.e.a.a.c.n.s.TransportNodesStatsAction] [node-hot-02] failed to execute on node [mIKSJxiwQIWztvvbxdetIA]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [node-warm-01][10.4.0.13:9300][cluster:monitor/nodes/stats[n]] request_id [31216481] timed out after [15019ms]
        at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:1022) [elasticsearch-7.4.0.jar:7.4.0]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:703) [elasticsearch-7.4.0.jar:7.4.0]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
        at java.lang.Thread.run(Thread.java:830) [?:?]
[2021-02-02T16:43:39,688][WARN ][o.e.t.TransportService   ] [node-hot-02] Received response for a request that has timed out, sent [25631ms] ago, timed out [10612ms] ago, action [cluster:monitor/nodes/stats[n]], node [{node-warm-01}{mIKSJxiwQIWztvvbxdetIA}{YqcLPekQRsyYb0loLQQpUw}{10.4.0.13}{10.4.0.13:9300}{dm}{xpack.installed=true, box_type=warm}], id [31216481]
[2021-02-02T16:43:44,067][WARN ][o.e.c.InternalClusterInfoService] [node-hot-02] Failed to update shard information for ClusterInfoUpdateJob within 15s timeout
[2021-02-02T16:44:24,513][INFO ][o.e.x.s.a.AuthenticationService] [node-hot-02] Authentication of [elastic] was terminated by realm [reserved] - failed to authenticate user [elastic]
[2021-02-02T16:44:29,068][DEBUG][o.e.a.a.c.n.s.TransportNodesStatsAction] [node-hot-02] failed to execute on node [mIKSJxiwQIWztvvbxdetIA]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [node-warm-01][10.4.0.13:9300][cluster:monitor/nodes/stats[n]] request_id [31244112] timed out after [15024ms]
        at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:1022) [elasticsearch-7.4.0.jar:7.4.0]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:703) [elasticsearch-7.4.0.jar:7.4.0]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
        at java.lang.Thread.run(Thread.java:830) [?:?]
[2021-02-02T16:44:38,778][WARN ][o.e.x.w.e.ExecutionService] [node-hot-02] failed to execute watch [MbZd5nRsQMCTfOguDES1dQ_elasticsearch_cluster_status]
[2021-02-02T16:44:44,070][WARN ][o.e.c.InternalClusterInfoService] [node-hot-02] Failed to update shard information for ClusterInfoUpdateJob within 15s timeout

We have 4 nodes, version 7.4, in a hot-warm configuration and 1 ML node version 7.4
Every node works on its own physical host, each one with 30gb heap and 50% or more or RAM reserved to the system

The cluster counts 1466 shards and 733 indices, and each index has only one shard, rollovered to 50GB

I am attaching the cluster stats and one screenshot from Kibana that shows the missing data in the graphs caused by the timeout

    {
      "_nodes" : {
        "total" : 5,
        "successful" : 5,
        "failed" : 0
      },
      "cluster_name" : "ms-production",
      "cluster_uuid" : "MbZd5nRsQMCTfOguDES1dQ",
      "timestamp" : 1612278348648,
      "status" : "green",
      "indices" : {
        "count" : 733,
        "shards" : {
          "total" : 1466,
          "primaries" : 733,
          "replication" : 1.0,
          "index" : {
            "shards" : {
              "min" : 2,
              "max" : 2,
              "avg" : 2.0
            },
            "primaries" : {
              "min" : 1,
              "max" : 1,
              "avg" : 1.0
            },
            "replication" : {
              "min" : 1.0,
              "max" : 1.0,
              "avg" : 1.0
            }
          }
        },
        "docs" : {
          "count" : 1645587175,
          "deleted" : 20435917
        },
        "store" : {
          "size" : "3.4tb",
          "size_in_bytes" : 3821384849117
        },
        "fielddata" : {
          "memory_size" : "689.9kb",
          "memory_size_in_bytes" : 706512,
          "evictions" : 0
        },
        "query_cache" : {
          "memory_size" : "434.6mb",
          "memory_size_in_bytes" : 455712275,
          "total_count" : 258821394,
          "hit_count" : 16127397,
          "miss_count" : 242693997,
          "cache_size" : 481276,
          "cache_count" : 669959,
          "evictions" : 188683
        },
        "completion" : {
          "size" : "0b",
          "size_in_bytes" : 0
        },
        "segments" : {
          "count" : 9316,
          "memory" : "2.6gb",
          "memory_in_bytes" : 2898584427,
          "terms_memory" : "922.9mb",
          "terms_memory_in_bytes" : 967800836,
          "stored_fields_memory" : "1.1gb",
          "stored_fields_memory_in_bytes" : 1265345552,
          "term_vectors_memory" : "0b",
          "term_vectors_memory_in_bytes" : 0,
          "norms_memory" : "12.8mb",
          "norms_memory_in_bytes" : 13512832,
          "points_memory" : "563.7mb",
          "points_memory_in_bytes" : 591143621,
          "doc_values_memory" : "57.9mb",
          "doc_values_memory_in_bytes" : 60781586,
          "index_writer_memory" : "370.6mb",
          "index_writer_memory_in_bytes" : 388685446,
          "version_map_memory" : "11.8mb",
          "version_map_memory_in_bytes" : 12473811,
          "fixed_bit_set" : "17.8mb",
          "fixed_bit_set_memory_in_bytes" : 18714720,
          "max_unsafe_auto_id_timestamp" : 1612265701050,
          "file_sizes" : { }
        }
      },
      "nodes" : {
        "count" : {
          "total" : 5,
          "coordinating_only" : 0,
          "data" : 4,
          "ingest" : 0,
          "master" : 3,
          "ml" : 1,
          "voting_only" : 0
        },
        "versions" : [
          "7.4.0"
        ],
        "os" : {
          "available_processors" : 272,
          "allocated_processors" : 272,
          "names" : [
            {
              "name" : "Linux",
              "count" : 5
            }
          ],
          "pretty_names" : [
            {
              "pretty_name" : "CentOS Linux 7 (Core)",
              "count" : 5
            }
          ],
          "mem" : {
            "total" : "500.7gb",
            "total_in_bytes" : 537640677376,
            "free" : "10.1gb",
            "free_in_bytes" : 10909540352,
            "used" : "490.5gb",
            "used_in_bytes" : 526731137024,
            "free_percent" : 2,
            "used_percent" : 98
          }
        },
        "process" : {
          "cpu" : {
            "percent" : 2
          },
          "open_file_descriptors" : {
            "min" : 898,
            "max" : 7895,
            "avg" : 6199
          }
        },
        "jvm" : {
          "max_uptime" : "4.2d",
          "max_uptime_in_millis" : 364499181,
          "versions" : [
            {
              "version" : "13",
              "vm_name" : "OpenJDK 64-Bit Server VM",
              "vm_version" : "13+33",
              "vm_vendor" : "AdoptOpenJDK",
              "bundled_jdk" : true,
              "using_bundled_jdk" : true,
              "count" : 5
            }
          ],
          "mem" : {
            "heap_used" : "63.6gb",
            "heap_used_in_bytes" : 68395717384,
            "heap_max" : "150gb",
            "heap_max_in_bytes" : 161061273600
          },
          "threads" : 2635
        },
        "fs" : {
          "total" : "23.3tb",
          "total_in_bytes" : 25721315151872,
          "free" : "19.7tb",
          "free_in_bytes" : 21740976001024,
          "available" : "18.5tb",
          "available_in_bytes" : 20441425334272
        },
        "plugins" : [ ],
        "network_types" : {
          "transport_types" : {
            "security4" : 5
          },
          "http_types" : {
            "security4" : 5
          }
        },
        "discovery_types" : {
          "zen" : 5
        },
        "packaging_types" : [
          {
            "flavor" : "default",
            "type" : "rpm",
            "count" : 5
          }
        ]
      }
    }

Thanks

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.