Elastic data nodes are randomly getting unresponsive

fschoell · December 14, 2021, 12:21pm

Hello, we run an Elasticsearch cluster with 12 data nodes of which groups of 4 share a larger dom server and are run in linux containers (dom servers are currently 48 core CPUs with 1.5 TB of memory and large SSD arrays). We are increasingly having issues that some of those data nodes are getting unresponsive, which means that API requests collecting node stats are timing out (such as GET _cluster/stats or GET _nodes/_all). This often (but not always) will impact data injection and our filebeat instances might run behind indexing our nginx logs. The GET _cat/health?v keeps reporting the cluster as healthy. Impacted nodes also have a significantly higher CPU usage as those not impacted, but CPU will drop after the Elasticsearch service has been restarted.

Looking at the logs of the affected data nodes it seems to be an issue with the node stats collector:

[2021-12-14T11:36:18,219][ERROR][o.e.x.m.c.n.NodeStatsCollector] [esh1-data13c] collector [node_stats] timed out when collecting data: node [oTyaaQiuQk-7osC9yirMwQ] did not respond within [30s]

I've created a Gist with different results from the Elastic API, including:

GET _cluster/stats

GET _nodes/hot_threads?threads=9999 (on that request the two data nodes impacted are esh1-data13b and esh1-data13c)

Any idea where this issue might come from?

system · January 11, 2022, 12:21pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Cluster has become unresponsive Elasticsearch	9	1216	February 21, 2019
Elasticsearch node not responding (ES OSS version 7.6.1) Elasticsearch	2	388	October 18, 2020
Weird cluster failure (unresponsive node + unresponsive load balancer) Elasticsearch	3	979	July 6, 2017
Single unresponsive node stalls overall cluster performance Elasticsearch	3	745	February 1, 2018
Cluster Becomes Unresponsive for 90 Sec After Data Node Leaves Elasticsearch	2	808	March 3, 2017

Elastic data nodes are randomly getting unresponsive

Related topics