The transport thread cannot be executed by CPU for a long time due to node-level scheduling/resource contention in Elasticsearch cluster

arT1 · December 15, 2025, 9:02am

The transport thread cannot be executed by CPU for a long time due to node-level scheduling/resource contention in Elasticsearch cluster
Environment:
Elasticsearch version: v8.13.3
Deploy: tar.gz
Service management: systemd
The physical host is dual-way NUMA with 80CPU logical cores and 380GB memory. This physical host deploys two hot data nodes, one warm data node, and three logstash instances. This is the only way to deploy due to resource constraints. By observing the node process CPU monitoring, the following scheme is given:

1、CPU Allocation Strategy
The Systemd isolation configuration is implemented to maximize performance by taking advantage of NUMA architecture features to limit the memory and CPU of the same JVM to the same NUMA Node as much as possible to reduce cross-node memory access latency.

[Service]
#...
CPUAffinity=1-12 41-52

2、node.processors ( Thread pools | Elasticsearch Guide [8.13] | Elastic )
Modify elasticsearch.yml

node.processors: 24

Modify jvm.options

-XX:ActiveProcessorCount=24

Can this solution solve the problem of transport timeout caused by resource contention?

DavidTurner · December 15, 2025, 1:08pm

Where are you getting this information from?

I wouldn’t expect so, no.

arT1 · December 15, 2025, 2:09pm

master node logs

[WARN ][o.e.c.InternalClusterInfoService] [xxxx] failed to retrieve shard stats from node [3BVEF_zmSceXcMw7Uv3Qyg]org.elasticsearch.transport.ReceiveTimeoutTransportException: [xxxx][xxxx][indices:monitor/stats[n]] request_id [1820674096] timed out after [15006ms]

[WARN ][o.e.c.InternalClusterInfoService] [xxxx] failed to retrieve shard stats from node [MVfdVFkoQqeimCtvkvgsjQ]org.elasticsearch.transport.ReceiveTimeoutTransportException: [xxxx][xxxx][indices:monitor/stats[n]] request_id [1820674098]timed out after [15006ms]

[ERROR][o.e.x.m.c.c.ClusterStatsCollector] [xxxx] collector [cluster_stats] timed out when collecting data: nodes [3BVEF_zmSceXcMw7Uv3Qyg, MVfdV
FkoQqeimCtvkvgsjQ, aonbQRHEQxW4deheX-pWww] did not respond within [10s]

[INFO ][o.e.c.r.a.AllocationService] [xxxx] current.health="RED" message="Cluster health status changed from [GREEN] to [RED] (reason: [{xxxxxx}{MVfdVFkoQqeimCtvkvgsjQ}{cJGvljffQ-mQqFxBeOy7Zw}{xxxxxx}{xxxxxx}{xxxxxx}{hs}{8.13.3}{7000099-8503000} reason: followers check retry count exceeded [timeouts=3, failures=0]])." previous.health="GREEN" reason="{xxxxxx}{MVfdVFkoQqeimCtvkvgsjQ}{cJGvljffQ-mQqFxBeOy7Zw}{xxxxxx}{10.109.97.51}{xxxxxx}{hs}{8.13.3}{7000099-8503000} reason: followers check retry count exceeded [timeouts=3, failures=0]"

At the same time, the overall CPU utilization of the physical machine is about 28.8%, and the average 5-minute load is about 63.58. CPU single core usage is between 10% and 50%. thread_pool.write.rejected and thread_pool.write.queue increased by about 10,000.The stack monitoring shows that the JVM of this node is normal and there is no Full GC. The network between the cluster nodes is all right.NVMe high performance disk was used for the data disk, and the basic monitoring showed that the disk IO was normal.

Two minutes later the anomalous node rejoins the cluster. After two minutes, the cluster state changes to YELLOW. The cluster state changes to GREEN after 16 minutes.

It's peak logging time, I think it's because the physical machine multiple Elasticsearch processes are not isolated --> a large number of logs are written to hot data nodes --> CPU scheduling delay --> ES transport thread is starved --> trigger follower check retry countexceeded --> The cluster state changes to red

DavidTurner · December 15, 2025, 2:51pm

Possibly, but there’s a bunch of other explanations that seem more likely IMO. Quite possibly this is a bug that’s been fixed in the 18+ months since 8.13.3 was released. The manual contains the proper troubleshooting process that you need to follow, although it’d be simpler to upgrade to a newer version to pick up all the relevant bugfixes first.

Topic		Replies	Views
SocketTimeout Exception issue reported for Elasticsearch Node with moderate resource usage Elasticsearch	10	573	July 7, 2024
Unusual High CPU usage of cluster nodes Elasticsearch	1	922	December 20, 2018
Spontaneously uneven CPU utilization between data nodes in the cluster Elasticsearch	2	1311	July 5, 2017
All Primary Nodes Maxed out CPU Elasticsearch	3	451	June 9, 2020
Only one of the data nodes has a significantly higher cpu usage than other data nodes Elasticsearch	1	205	March 27, 2023

The transport thread cannot be executed by CPU for a long time due to node-level scheduling/resource contention in Elasticsearch cluster

Related topics