Elasticsearch -Understanding Hot Threads

Hi, We are running a 9 Node Cluster on which I am seeing very high CPU Utilisation. (nodes.process.cpu.percent). This metric sometimes reaches 200%. I am not sure If I am able to understand the underlying cause to fix it.

When doing a little search, I found that we need to check the hot threads and see what is being done. Sharing that as well, If that is going to be helpful

I can see this when I run the hot threads

::: {i3-node-06}{EmNojtlbR5mZ1gO-StswwQ}{t6s5Cl0sSJ2A4KzA6wFHjA}
   Hot threads at 2018-09-27T06:52:56.239Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:
   
   21.3% (106.2ms out of 500ms) cpu usage by thread 'elasticsearch[i3-node-06][management][T#3]'
     2/10 snapshots sharing following 31 elements
       java.lang.SecurityManager.checkRead(SecurityManager.java:888)
       sun.nio.fs.UnixPath.checkRead(UnixPath.java:795)
       sun.nio.fs.UnixFileAttributeViews$Basic.readAttributes(UnixFileAttributeViews.java:49)
       sun.nio.fs.UnixFileSystemProvider.readAttributes(UnixFileSystemProvider.java:144)
       sun.nio.fs.LinuxFileSystemProvider.readAttributes(LinuxFileSystemProvider.java:99)
       java.nio.file.Files.readAttributes(Files.java:1737)
       java.nio.file.Files.size(Files.java:2332)
       org.apache.lucene.store.FSDirectory.fileLength(FSDirectory.java:243)
       org.apache.lucene.store.FilterDirectory.fileLength(FilterDirectory.java:67)
       org.apache.lucene.store.FilterDirectory.fileLength(FilterDirectory.java:67)
       org.elasticsearch.index.store.Store$StoreStatsCache.estimateSize(Store.java:1370)
       org.elasticsearch.index.store.Store$StoreStatsCache.refresh(Store.java:1359)
       org.elasticsearch.index.store.Store$StoreStatsCache.refresh(Store.java:1346)
       org.elasticsearch.common.util.SingleObjectCache.getOrRefresh(SingleObjectCache.java:54)
       org.elasticsearch.index.store.Store.stats(Store.java:300)
       org.elasticsearch.index.shard.IndexShard.storeStats(IndexShard.java:702)
       org.elasticsearch.action.admin.indices.stats.CommonStats.<init>(CommonStats.java:177)
       org.elasticsearch.indices.IndicesService.stats(IndicesService.java:311)
       org.elasticsearch.node.NodeService.stats(NodeService.java:105)
       org.elasticsearch.action.admin.cluster.node.stats.TransportNodesStatsAction.nodeOperation(TransportNodesStatsAction.java:77)
       org.elasticsearch.action.admin.cluster.node.stats.TransportNodesStatsAction.nodeOperation(TransportNodesStatsAction.java:42)
       org.elasticsearch.action.support.nodes.TransportNodesAction.nodeOperation(TransportNodesAction.java:145)
       org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:269)
       org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:265)       org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69)
       org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1488)
org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638)
       org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
       java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
       java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
       java.lang.Thread.run(Thread.java:748)
     8/10 snapshots sharing following 2 elements
       java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
       java.lang.Thread.run(Thread.java:748)
   
   19.0% (95.1ms out of 500ms) cpu usage by thread 'elasticsearch[i3-node-06][search][T#3]'
     2/10 snapshots sharing following 23 elements
       org.apache.lucene.util.bkd.BKDReader.intersect(BKDReader.java:487)   org.apache.lucene.codecs.lucene60.Lucene60PointsReader.intersect(Lucene60PointsReader.java:142)
       org.apache.lucene.search.PointInSetQuery$1.scorer(PointInSetQuery.java:141)
       org.apache.lucene.search.LRUQueryCache$CachingWrapperWeight.scorer(LRUQueryCache.java:726)
       org.elasticsearch.indices.IndicesQueryCache$CachingWeightWrapper.scorer(IndicesQueryCache.java:155)
       org.apache.lucene.search.BooleanWeight.scorer(BooleanWeight.java:389)
       org.apache.lucene.search.Weight.bulkScorer(Weight.java:135)
       org.apache.lucene.search.BooleanWeight.bulkScorer(BooleanWeight.java:370)
       org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:666)
       org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:473)
       org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:397)
       org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:108)
       org.elasticsearch.search.SearchService.loadOrExecuteQueryPhase(SearchService.java:247)
       org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:261)
       org.elasticsearch.action.search.SearchTransportService$6.messageReceived(SearchTransportService.java:331)
       org.elasticsearch.action.search.SearchTransportService$6.messageReceived(SearchTransportService.java:328)
       org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69)
       org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1488)
       org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638)
       org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
       java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
       java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
       java.lang.Thread.run(Thread.java:748)
     8/10 snapshots sharing following 10 elements
       sun.misc.Unsafe.park(Native Method)
       java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
       java.util.concurrent.LinkedTransferQueue.awaitMatch(LinkedTransferQueue.java:737)
       java.util.concurrent.LinkedTransferQueue.xfer(LinkedTransferQueue.java:647)
       java.util.concurrent.LinkedTransferQueue.take(LinkedTransferQueue.java:1269)
       org.elasticsearch.common.util.concurrent.SizeBlockingQueue.take(SizeBlockingQueue.java:161)
       java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
       java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
       java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
       java.lang.Thread.run(Thread.java:748)

Can you help me on how do I exactly read the above output. Where do I look for the gaps? Also, The individual CPU usage of every node is at an average of 60% with 1 min interval.

Please help me out if possible. Also, If there is something else needed, Please let me know.

I have stripped down the results from the hot threads API result since the description supports only 7000 characters. Thank you!

Please view complete result here

http://s000.tinyupload.com/index.php?file_id=05121898809573789003

It looks like CPU is taken up by a mix of searches and management.

Which version of Elasticsearch are you using?

How many indices and shards do you have in the cluster?

Are you frequently updating mappings, e.g. through dynamic mappings, or creating/deleting indices in the cluster? Are you polling it for statistics?

Hi Christian,

Thank you for your response.

It looks like CPU is taken up by a mix of searches and management. - The cluster is search intensive.

Which version of Elasticsearch are you using? We are using Elasticsearch 5.3.2

How many indices and shards do you have in the cluster?
Total Indices : 53
Active Shards : 340
Primary Shards : 170

Most of the data is rolled over for metrics daily. Used intermittently.

There are two index in the cluster which are search intensive. One of the indices is having the following details

Index 1 :
Primary Shards : 7
Replica Count : 1
Total Doc Count : 21602025
Total Doc Deleted : 4365425
Store Size : 97.2gb
pri.store.size : 48.8gb

Index 2 :
Primary Shards : 3
Replica Count : 1
Total Doc Count : 7404449
Total Doc Deleted : 433035
Store Size : 9.3gb
pri.store.size : 4.6gb

These two indices are frequently used for searches.

Are you frequently updating mappings, e.g. through dynamic mappings, or creating/deleting indices in the cluster? Are you polling it for statistics?

We are not updating the mapping frequently. However, There is one index which is rolled over every night, but those are generally for metrics, queried intermittently.

Hey Christian, Did you get a chance to look at the details shared? It would be great if you could help me out.

Also, If you could help me understand the metric nodes.process.cpu.percent as well.

Thanks in advance!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.