Spontaneously uneven CPU utilization between data nodes in the cluster

Radiosterne · August 23, 2015, 9:25am

Hello gentlemen!

We have ES cluster consisting of nine virtual machines — four data nodes, two client nodes (allocated evenly across two physical hosts) and three master nodes (one per physical host), which was running smoothly in this configuration for about 8 to 9 months.

However, last night two data nodes (allocated on the same physical host) spontaneously doubled their CPU utilization without, it seems, any additional workload — overall cluster throughput haven't changed, RPS values according to /nodes/stats API were distributed even between al four data nodes.

At the time of this event, cluster was in search-only mode, without any data-modifying operations, and if I'm interpreting hot_threads output correctly, searching operations, indeed, are a problem:

35.8% (178.8ms out of 500ms) cpu usage by thread 'elasticsearch[mk_es_01-data_011][search][T#14]'
 10/10 snapshots sharing following 10 elements
   sun.misc.Unsafe.park(Native Method)
   java.util.concurrent.locks.LockSupport.park(Unknown Source)
   java.util.concurrent.LinkedTransferQueue.awaitMatch(Unknown Source)
   java.util.concurrent.LinkedTransferQueue.xfer(Unknown Source)
   java.util.concurrent.LinkedTransferQueue.take(Unknown Source)
   org.elasticsearch.common.util.concurrent.SizeBlockingQueue.take(SizeBlockingQueue.java:162)
   java.util.concurrent.ThreadPoolExecutor.getTask(Unknown Source)
   java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
   java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
   java.lang.Thread.run(Unknown Source)

8.1% (40.5ms out of 500ms) cpu usage by thread 'elasticsearch[mk_es_01-data_011][management][T#5]'
 10/10 snapshots sharing following 9 elements
   sun.misc.Unsafe.park(Native Method)
   java.util.concurrent.locks.LockSupport.parkNanos(Unknown Source)
   java.util.concurrent.LinkedTransferQueue.awaitMatch(Unknown Source)
   java.util.concurrent.LinkedTransferQueue.xfer(Unknown Source)
   java.util.concurrent.LinkedTransferQueue.poll(Unknown Source)
   java.util.concurrent.ThreadPoolExecutor.getTask(Unknown Source)
   java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
   java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
   java.lang.Thread.run(Unknown Source)

7.5% (37.3ms out of 500ms) cpu usage by thread 'elasticsearch[mk_es_01-data_011][search][T#5]'
 10/10 snapshots sharing following 10 elements
   sun.misc.Unsafe.park(Native Method)
   java.util.concurrent.locks.LockSupport.park(Unknown Source)
   java.util.concurrent.LinkedTransferQueue.awaitMatch(Unknown Source)
   java.util.concurrent.LinkedTransferQueue.xfer(Unknown Source)
   java.util.concurrent.LinkedTransferQueue.take(Unknown Source)
   org.elasticsearch.common.util.concurrent.SizeBlockingQueue.take(SizeBlockingQueue.java:162)
   java.util.concurrent.ThreadPoolExecutor.getTask(Unknown Source)
   java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
   java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
   java.lang.Thread.run(Unknown Source)

So, if you have any insight into what should be my next steps in diagnosting this problem, or have encountered something similar yourself, please comment below

Thanks in advance.

warkolm · August 23, 2015, 10:13pm

As with any troubleshooting;

What specs are your nodes
What version are you on?
How are you monitoring things?
Have you checked your logs?
What's changed?

Topic		Replies	Views
Very high CPU usage on one Elasticsearch data node Elasticsearch	18	33894	May 9, 2018
High CPU usage on only 1 Data node Elasticsearch	7	1018	October 16, 2020
ES 1.4.2 Performance Issue Elasticsearch	3	1228	July 5, 2017
Mismatched CPU usages on data nodes Elasticsearch	6	586	January 28, 2019
Only one of the data nodes has a significantly higher cpu usage than other data nodes Elasticsearch	1	199	March 27, 2023

Spontaneously uneven CPU utilization between data nodes in the cluster

Related topics