Spontaneously uneven CPU utilization between data nodes in the cluster

Hello gentlemen!

We have ES cluster consisting of nine virtual machines — four data nodes, two client nodes (allocated evenly across two physical hosts) and three master nodes (one per physical host), which was running smoothly in this configuration for about 8 to 9 months.

However, last night two data nodes (allocated on the same physical host) spontaneously doubled their CPU utilization without, it seems, any additional workload — overall cluster throughput haven't changed, RPS values according to /nodes/stats API were distributed even between al four data nodes.

At the time of this event, cluster was in search-only mode, without any data-modifying operations, and if I'm interpreting hot_threads output correctly, searching operations, indeed, are a problem:

35.8% (178.8ms out of 500ms) cpu usage by thread 'elasticsearch[mk_es_01-data_011][search][T#14]'
 10/10 snapshots sharing following 10 elements
   sun.misc.Unsafe.park(Native Method)
   java.util.concurrent.locks.LockSupport.park(Unknown Source)
   java.util.concurrent.LinkedTransferQueue.awaitMatch(Unknown Source)
   java.util.concurrent.LinkedTransferQueue.xfer(Unknown Source)
   java.util.concurrent.LinkedTransferQueue.take(Unknown Source)
   org.elasticsearch.common.util.concurrent.SizeBlockingQueue.take(SizeBlockingQueue.java:162)
   java.util.concurrent.ThreadPoolExecutor.getTask(Unknown Source)
   java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
   java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
   java.lang.Thread.run(Unknown Source)

8.1% (40.5ms out of 500ms) cpu usage by thread 'elasticsearch[mk_es_01-data_011][management][T#5]'
 10/10 snapshots sharing following 9 elements
   sun.misc.Unsafe.park(Native Method)
   java.util.concurrent.locks.LockSupport.parkNanos(Unknown Source)
   java.util.concurrent.LinkedTransferQueue.awaitMatch(Unknown Source)
   java.util.concurrent.LinkedTransferQueue.xfer(Unknown Source)
   java.util.concurrent.LinkedTransferQueue.poll(Unknown Source)
   java.util.concurrent.ThreadPoolExecutor.getTask(Unknown Source)
   java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
   java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
   java.lang.Thread.run(Unknown Source)

7.5% (37.3ms out of 500ms) cpu usage by thread 'elasticsearch[mk_es_01-data_011][search][T#5]'
 10/10 snapshots sharing following 10 elements
   sun.misc.Unsafe.park(Native Method)
   java.util.concurrent.locks.LockSupport.park(Unknown Source)
   java.util.concurrent.LinkedTransferQueue.awaitMatch(Unknown Source)
   java.util.concurrent.LinkedTransferQueue.xfer(Unknown Source)
   java.util.concurrent.LinkedTransferQueue.take(Unknown Source)
   org.elasticsearch.common.util.concurrent.SizeBlockingQueue.take(SizeBlockingQueue.java:162)
   java.util.concurrent.ThreadPoolExecutor.getTask(Unknown Source)
   java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
   java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
   java.lang.Thread.run(Unknown Source)

So, if you have any insight into what should be my next steps in diagnosting this problem, or have encountered something similar yourself, please comment below :slight_smile:

Thanks in advance.

As with any troubleshooting;

  • What specs are your nodes
  • What version are you on?
  • How are you monitoring things?
  • Have you checked your logs?
  • What's changed?