Elastic data nodes are randomly getting unresponsive

Hello, we run an Elasticsearch cluster with 12 data nodes of which groups of 4 share a larger dom server and are run in linux containers (dom servers are currently 48 core CPUs with 1.5 TB of memory and large SSD arrays). We are increasingly having issues that some of those data nodes are getting unresponsive, which means that API requests collecting node stats are timing out (such as GET _cluster/stats or GET _nodes/_all). This often (but not always) will impact data injection and our filebeat instances might run behind indexing our nginx logs. The GET _cat/health?v keeps reporting the cluster as healthy. Impacted nodes also have a significantly higher CPU usage as those not impacted, but CPU will drop after the Elasticsearch service has been restarted.

Looking at the logs of the affected data nodes it seems to be an issue with the node stats collector:

[2021-12-14T11:36:18,219][ERROR][o.e.x.m.c.n.NodeStatsCollector] [esh1-data13c] collector [node_stats] timed out when collecting data: node [oTyaaQiuQk-7osC9yirMwQ] did not respond within [30s]

I've created a Gist with different results from the Elastic API, including:

GET _cluster/stats

GET _nodes/hot_threads?threads=9999 (on that request the two data nodes impacted are esh1-data13b and esh1-data13c)

Any idea where this issue might come from?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.