Elasticsearch -Understanding Hot Threads

somnath.guthula · September 27, 2018, 7:16am

Hi, We are running a 9 Node Cluster on which I am seeing very high CPU Utilisation. (nodes.process.cpu.percent). This metric sometimes reaches 200%. I am not sure If I am able to understand the underlying cause to fix it.

When doing a little search, I found that we need to check the hot threads and see what is being done. Sharing that as well, If that is going to be helpful

I can see this when I run the hot threads

::: {i3-node-06}{EmNojtlbR5mZ1gO-StswwQ}{t6s5Cl0sSJ2A4KzA6wFHjA}
   Hot threads at 2018-09-27T06:52:56.239Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:
   
   21.3% (106.2ms out of 500ms) cpu usage by thread 'elasticsearch[i3-node-06][management][T#3]'
     2/10 snapshots sharing following 31 elements
       java.lang.SecurityManager.checkRead(SecurityManager.java:888)
       sun.nio.fs.UnixPath.checkRead(UnixPath.java:795)
       sun.nio.fs.UnixFileAttributeViews$Basic.readAttributes(UnixFileAttributeViews.java:49)
       sun.nio.fs.UnixFileSystemProvider.readAttributes(UnixFileSystemProvider.java:144)
       sun.nio.fs.LinuxFileSystemProvider.readAttributes(LinuxFileSystemProvider.java:99)
       java.nio.file.Files.readAttributes(Files.java:1737)
       java.nio.file.Files.size(Files.java:2332)
       org.apache.lucene.store.FSDirectory.fileLength(FSDirectory.java:243)
       org.apache.lucene.store.FilterDirectory.fileLength(FilterDirectory.java:67)
       org.apache.lucene.store.FilterDirectory.fileLength(FilterDirectory.java:67)
       org.elasticsearch.index.store.Store$StoreStatsCache.estimateSize(Store.java:1370)
       org.elasticsearch.index.store.Store$StoreStatsCache.refresh(Store.java:1359)
       org.elasticsearch.index.store.Store$StoreStatsCache.refresh(Store.java:1346)
       org.elasticsearch.common.util.SingleObjectCache.getOrRefresh(SingleObjectCache.java:54)
       org.elasticsearch.index.store.Store.stats(Store.java:300)
       org.elasticsearch.index.shard.IndexShard.storeStats(IndexShard.java:702)
       org.elasticsearch.action.admin.indices.stats.CommonStats.<init>(CommonStats.java:177)
       org.elasticsearch.indices.IndicesService.stats(IndicesService.java:311)
       org.elasticsearch.node.NodeService.stats(NodeService.java:105)
       org.elasticsearch.action.admin.cluster.node.stats.TransportNodesStatsAction.nodeOperation(TransportNodesStatsAction.java:77)
       org.elasticsearch.action.admin.cluster.node.stats.TransportNodesStatsAction.nodeOperation(TransportNodesStatsAction.java:42)
       org.elasticsearch.action.support.nodes.TransportNodesAction.nodeOperation(TransportNodesAction.java:145)
       org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:269)
       org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:265)       org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69)
       org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1488)
org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638)
       org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
       java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
       java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
       java.lang.Thread.run(Thread.java:748)
     8/10 snapshots sharing following 2 elements
       java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
       java.lang.Thread.run(Thread.java:748)
   
   19.0% (95.1ms out of 500ms) cpu usage by thread 'elasticsearch[i3-node-06][search][T#3]'
     2/10 snapshots sharing following 23 elements
       org.apache.lucene.util.bkd.BKDReader.intersect(BKDReader.java:487)   org.apache.lucene.codecs.lucene60.Lucene60PointsReader.intersect(Lucene60PointsReader.java:142)
       org.apache.lucene.search.PointInSetQuery$1.scorer(PointInSetQuery.java:141)
       org.apache.lucene.search.LRUQueryCache$CachingWrapperWeight.scorer(LRUQueryCache.java:726)
       org.elasticsearch.indices.IndicesQueryCache$CachingWeightWrapper.scorer(IndicesQueryCache.java:155)
       org.apache.lucene.search.BooleanWeight.scorer(BooleanWeight.java:389)
       org.apache.lucene.search.Weight.bulkScorer(Weight.java:135)
       org.apache.lucene.search.BooleanWeight.bulkScorer(BooleanWeight.java:370)
       org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:666)
       org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:473)
       org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:397)
       org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:108)
       org.elasticsearch.search.SearchService.loadOrExecuteQueryPhase(SearchService.java:247)
       org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:261)
       org.elasticsearch.action.search.SearchTransportService$6.messageReceived(SearchTransportService.java:331)
       org.elasticsearch.action.search.SearchTransportService$6.messageReceived(SearchTransportService.java:328)
       org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69)
       org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1488)
       org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638)
       org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
       java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
       java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
       java.lang.Thread.run(Thread.java:748)
     8/10 snapshots sharing following 10 elements
       sun.misc.Unsafe.park(Native Method)
       java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
       java.util.concurrent.LinkedTransferQueue.awaitMatch(LinkedTransferQueue.java:737)
       java.util.concurrent.LinkedTransferQueue.xfer(LinkedTransferQueue.java:647)
       java.util.concurrent.LinkedTransferQueue.take(LinkedTransferQueue.java:1269)
       org.elasticsearch.common.util.concurrent.SizeBlockingQueue.take(SizeBlockingQueue.java:161)
       java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
       java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
       java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
       java.lang.Thread.run(Thread.java:748)

Can you help me on how do I exactly read the above output. Where do I look for the gaps? Also, The individual CPU usage of every node is at an average of 60% with 1 min interval.

Please help me out if possible. Also, If there is something else needed, Please let me know.

somnath.guthula · September 27, 2018, 7:32am

I have stripped down the results from the hot threads API result since the description supports only 7000 characters. Thank you!

Please view complete result here

http://s000.tinyupload.com/index.php?file_id=05121898809573789003

Christian_Dahlqvist · September 27, 2018, 8:11am

It looks like CPU is taken up by a mix of searches and management.

Which version of Elasticsearch are you using?

How many indices and shards do you have in the cluster?

Are you frequently updating mappings, e.g. through dynamic mappings, or creating/deleting indices in the cluster? Are you polling it for statistics?

somnath.guthula · September 27, 2018, 9:19am

Hi Christian,

Thank you for your response.

It looks like CPU is taken up by a mix of searches and management. - The cluster is search intensive.

Which version of Elasticsearch are you using? We are using Elasticsearch 5.3.2

How many indices and shards do you have in the cluster?
Total Indices : 53
Active Shards : 340
Primary Shards : 170

Most of the data is rolled over for metrics daily. Used intermittently.

There are two index in the cluster which are search intensive. One of the indices is having the following details

Index 1 :
Primary Shards : 7
Replica Count : 1
Total Doc Count : 21602025
Total Doc Deleted : 4365425
Store Size : 97.2gb
pri.store.size : 48.8gb

Index 2 :
Primary Shards : 3
Replica Count : 1
Total Doc Count : 7404449
Total Doc Deleted : 433035
Store Size : 9.3gb
pri.store.size : 4.6gb

These two indices are frequently used for searches.

Are you frequently updating mappings, e.g. through dynamic mappings, or creating/deleting indices in the cluster? Are you polling it for statistics?

We are not updating the mapping frequently. However, There is one index which is rolled over every night, but those are generally for metrics, queried intermittently.

somnath.guthula · September 28, 2018, 7:29am

Hey Christian, Did you get a chance to look at the details shared? It would be great if you could help me out.

Also, If you could help me understand the metric nodes.process.cpu.percent as well.

Thanks in advance!

system · October 26, 2018, 7:29am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
My Elasticsearch is running at very high CPU (constantly 99%) - Need help understanding hot_threads Elasticsearch	2	1035	July 6, 2017
One thread 100% CPU Elasticsearch	10	2014	April 21, 2017
High CPU Usage across entire cluster Elasticsearch	9	1048	April 11, 2022
Thread Pool Management Elasticsearch	1	217	December 1, 2021
CPU utilization crossing 98% always Elasticsearch	2	1996	March 18, 2019

Elasticsearch -Understanding Hot Threads

Related topics