Hi, We are running a 9 Node Cluster on which I am seeing very high CPU Utilisation. (nodes.process.cpu.percent). This metric sometimes reaches 200%. I am not sure If I am able to understand the underlying cause to fix it.
When doing a little search, I found that we need to check the hot threads and see what is being done. Sharing that as well, If that is going to be helpful
I can see this when I run the hot threads
::: {i3-node-06}{EmNojtlbR5mZ1gO-StswwQ}{t6s5Cl0sSJ2A4KzA6wFHjA}
Hot threads at 2018-09-27T06:52:56.239Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:
21.3% (106.2ms out of 500ms) cpu usage by thread 'elasticsearch[i3-node-06][management][T#3]'
2/10 snapshots sharing following 31 elements
java.lang.SecurityManager.checkRead(SecurityManager.java:888)
sun.nio.fs.UnixPath.checkRead(UnixPath.java:795)
sun.nio.fs.UnixFileAttributeViews$Basic.readAttributes(UnixFileAttributeViews.java:49)
sun.nio.fs.UnixFileSystemProvider.readAttributes(UnixFileSystemProvider.java:144)
sun.nio.fs.LinuxFileSystemProvider.readAttributes(LinuxFileSystemProvider.java:99)
java.nio.file.Files.readAttributes(Files.java:1737)
java.nio.file.Files.size(Files.java:2332)
org.apache.lucene.store.FSDirectory.fileLength(FSDirectory.java:243)
org.apache.lucene.store.FilterDirectory.fileLength(FilterDirectory.java:67)
org.apache.lucene.store.FilterDirectory.fileLength(FilterDirectory.java:67)
org.elasticsearch.index.store.Store$StoreStatsCache.estimateSize(Store.java:1370)
org.elasticsearch.index.store.Store$StoreStatsCache.refresh(Store.java:1359)
org.elasticsearch.index.store.Store$StoreStatsCache.refresh(Store.java:1346)
org.elasticsearch.common.util.SingleObjectCache.getOrRefresh(SingleObjectCache.java:54)
org.elasticsearch.index.store.Store.stats(Store.java:300)
org.elasticsearch.index.shard.IndexShard.storeStats(IndexShard.java:702)
org.elasticsearch.action.admin.indices.stats.CommonStats.<init>(CommonStats.java:177)
org.elasticsearch.indices.IndicesService.stats(IndicesService.java:311)
org.elasticsearch.node.NodeService.stats(NodeService.java:105)
org.elasticsearch.action.admin.cluster.node.stats.TransportNodesStatsAction.nodeOperation(TransportNodesStatsAction.java:77)
org.elasticsearch.action.admin.cluster.node.stats.TransportNodesStatsAction.nodeOperation(TransportNodesStatsAction.java:42)
org.elasticsearch.action.support.nodes.TransportNodesAction.nodeOperation(TransportNodesAction.java:145)
org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:269)
org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:265) org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69)
org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1488)
org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638)
org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
java.lang.Thread.run(Thread.java:748)
8/10 snapshots sharing following 2 elements
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
java.lang.Thread.run(Thread.java:748)
19.0% (95.1ms out of 500ms) cpu usage by thread 'elasticsearch[i3-node-06][search][T#3]'
2/10 snapshots sharing following 23 elements
org.apache.lucene.util.bkd.BKDReader.intersect(BKDReader.java:487) org.apache.lucene.codecs.lucene60.Lucene60PointsReader.intersect(Lucene60PointsReader.java:142)
org.apache.lucene.search.PointInSetQuery$1.scorer(PointInSetQuery.java:141)
org.apache.lucene.search.LRUQueryCache$CachingWrapperWeight.scorer(LRUQueryCache.java:726)
org.elasticsearch.indices.IndicesQueryCache$CachingWeightWrapper.scorer(IndicesQueryCache.java:155)
org.apache.lucene.search.BooleanWeight.scorer(BooleanWeight.java:389)
org.apache.lucene.search.Weight.bulkScorer(Weight.java:135)
org.apache.lucene.search.BooleanWeight.bulkScorer(BooleanWeight.java:370)
org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:666)
org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:473)
org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:397)
org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:108)
org.elasticsearch.search.SearchService.loadOrExecuteQueryPhase(SearchService.java:247)
org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:261)
org.elasticsearch.action.search.SearchTransportService$6.messageReceived(SearchTransportService.java:331)
org.elasticsearch.action.search.SearchTransportService$6.messageReceived(SearchTransportService.java:328)
org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69)
org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1488)
org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638)
org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
java.lang.Thread.run(Thread.java:748)
8/10 snapshots sharing following 10 elements
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
java.util.concurrent.LinkedTransferQueue.awaitMatch(LinkedTransferQueue.java:737)
java.util.concurrent.LinkedTransferQueue.xfer(LinkedTransferQueue.java:647)
java.util.concurrent.LinkedTransferQueue.take(LinkedTransferQueue.java:1269)
org.elasticsearch.common.util.concurrent.SizeBlockingQueue.take(SizeBlockingQueue.java:161)
java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
java.lang.Thread.run(Thread.java:748)
Can you help me on how do I exactly read the above output. Where do I look for the gaps? Also, The individual CPU usage of every node is at an average of 60% with 1 min interval.
Please help me out if possible. Also, If there is something else needed, Please let me know.