The transport thread cannot be executed by CPU for a long time due to node-level scheduling/resource contention in Elasticsearch cluster
Environment:
Elasticsearch version: v8.13.3
Deploy: tar.gz
Service management: systemd
The physical host is dual-way NUMA with 80CPU logical cores and 380GB memory. This physical host deploys two hot data nodes, one warm data node, and three logstash instances. This is the only way to deploy due to resource constraints. By observing the node process CPU monitoring, the following scheme is given:
1ăCPU Allocation Strategy
The Systemd isolation configuration is implemented to maximize performance by taking advantage of NUMA architecture features to limit the memory and CPU of the same JVM to the same NUMA Node as much as possible to reduce cross-node memory access latency.
[WARN ][o.e.c.InternalClusterInfoService] [xxxx] failed to retrieve shard stats from node [3BVEF_zmSceXcMw7Uv3Qyg]org.elasticsearch.transport.ReceiveTimeoutTransportException: [xxxx][xxxx][indices:monitor/stats[n]] request_id [1820674096] timed out after [15006ms]
[WARN ][o.e.c.InternalClusterInfoService] [xxxx] failed to retrieve shard stats from node [MVfdVFkoQqeimCtvkvgsjQ]org.elasticsearch.transport.ReceiveTimeoutTransportException: [xxxx][xxxx][indices:monitor/stats[n]] request_id [1820674098]timed out after [15006ms]
[ERROR][o.e.x.m.c.c.ClusterStatsCollector] [xxxx] collector [cluster_stats] timed out when collecting data: nodes [3BVEF_zmSceXcMw7Uv3Qyg, MVfdV
FkoQqeimCtvkvgsjQ, aonbQRHEQxW4deheX-pWww] did not respond within [10s]
[INFO ][o.e.c.r.a.AllocationService] [xxxx] current.health="RED" message="Cluster health status changed from [GREEN] to [RED] (reason: [{xxxxxx}{MVfdVFkoQqeimCtvkvgsjQ}{cJGvljffQ-mQqFxBeOy7Zw}{xxxxxx}{xxxxxx}{xxxxxx}{hs}{8.13.3}{7000099-8503000} reason: followers check retry count exceeded [timeouts=3, failures=0]])." previous.health="GREEN" reason="{xxxxxx}{MVfdVFkoQqeimCtvkvgsjQ}{cJGvljffQ-mQqFxBeOy7Zw}{xxxxxx}{10.109.97.51}{xxxxxx}{hs}{8.13.3}{7000099-8503000} reason: followers check retry count exceeded [timeouts=3, failures=0]"
At the same time, the overall CPU utilization of the physical machine is about 28.8%, and the average 5-minute load is about 63.58. CPU single core usage is between 10% and 50%. thread_pool.write.rejected and thread_pool.write.queue increased by about 10,000.The stack monitoring shows that the JVM of this node is normal and there is no Full GC. The network between the cluster nodes is all right.NVMe high performance disk was used for the data disk, and the basic monitoring showed that the disk IO was normal.
Two minutes later the anomalous node rejoins the cluster. After two minutes, the cluster state changes to YELLOW. The cluster state changes to GREEN after 16 minutes.
It's peak logging time, I think it's because the physical machine multiple Elasticsearch processes are not isolated --> a large number of logs are written to hot data nodes --> CPU scheduling delay --> ES transport thread is starved --> trigger follower check retry countexceeded --> The cluster state changes to red
Possibly, but thereâs a bunch of other explanations that seem more likely IMO. Quite possibly this is a bug thatâs been fixed in the 18+ months since 8.13.3 was released. The manual contains the proper troubleshooting process that you need to follow, although itâd be simpler to upgrade to a newer version to pick up all the relevant bugfixes first.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.