Hello gentlemen!
We have ES cluster consisting of nine virtual machines — four data nodes, two client nodes (allocated evenly across two physical hosts) and three master nodes (one per physical host), which was running smoothly in this configuration for about 8 to 9 months.
However, last night two data nodes (allocated on the same physical host) spontaneously doubled their CPU utilization without, it seems, any additional workload — overall cluster throughput haven't changed, RPS values according to /nodes/stats API were distributed even between al four data nodes.
At the time of this event, cluster was in search-only mode, without any data-modifying operations, and if I'm interpreting hot_threads output correctly, searching operations, indeed, are a problem:
35.8% (178.8ms out of 500ms) cpu usage by thread 'elasticsearch[mk_es_01-data_011][search][T#14]'
10/10 snapshots sharing following 10 elements
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.park(Unknown Source)
java.util.concurrent.LinkedTransferQueue.awaitMatch(Unknown Source)
java.util.concurrent.LinkedTransferQueue.xfer(Unknown Source)
java.util.concurrent.LinkedTransferQueue.take(Unknown Source)
org.elasticsearch.common.util.concurrent.SizeBlockingQueue.take(SizeBlockingQueue.java:162)
java.util.concurrent.ThreadPoolExecutor.getTask(Unknown Source)
java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
java.lang.Thread.run(Unknown Source)
8.1% (40.5ms out of 500ms) cpu usage by thread 'elasticsearch[mk_es_01-data_011][management][T#5]'
10/10 snapshots sharing following 9 elements
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.parkNanos(Unknown Source)
java.util.concurrent.LinkedTransferQueue.awaitMatch(Unknown Source)
java.util.concurrent.LinkedTransferQueue.xfer(Unknown Source)
java.util.concurrent.LinkedTransferQueue.poll(Unknown Source)
java.util.concurrent.ThreadPoolExecutor.getTask(Unknown Source)
java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
java.lang.Thread.run(Unknown Source)
7.5% (37.3ms out of 500ms) cpu usage by thread 'elasticsearch[mk_es_01-data_011][search][T#5]'
10/10 snapshots sharing following 10 elements
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.park(Unknown Source)
java.util.concurrent.LinkedTransferQueue.awaitMatch(Unknown Source)
java.util.concurrent.LinkedTransferQueue.xfer(Unknown Source)
java.util.concurrent.LinkedTransferQueue.take(Unknown Source)
org.elasticsearch.common.util.concurrent.SizeBlockingQueue.take(SizeBlockingQueue.java:162)
java.util.concurrent.ThreadPoolExecutor.getTask(Unknown Source)
java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
java.lang.Thread.run(Unknown Source)
So, if you have any insight into what should be my next steps in diagnosting this problem, or have encountered something similar yourself, please comment below
Thanks in advance.