One ES Data node's CPU jumps to 90%+ suddenly while in production

Hi, guys, we are running an ES cluster with 7.1.1 in production, and everything looks good while one day, a data node's CPU jumps to 90%+ suddenly, and all of the queries related to the index on that nodes were time out. However, when we restarted the ES process on that machine, everything goes back to normal and the problem could not be reproduced again. We have faced twice of this problem on the same cluster and we have no idea when will this happen again. Could anyone tell me how could I debug this problem?

Here is some information related to the ES cluster:
ES version: 7.1.1
nodes: 3 Master Nodes, 34 Data nodes with 30g JVM, CMS gc algorithm, all of the machines are SSD.

The flame graph on that node when the CPU was high.


It seems that it is not related to any special queries. When this happened, any query could cause the CPU jumps to 90%+, and when restarted the ES process, we replayed the same queries and they didn't cause the phenomenon again.

Thanks for any help!

What do hot threads and slow log show for that node?
Do you have Monitoring enabled?

hello, @warkolm thanks for your replying. I didn't record the hot threads and slow log at that time. I will provide this information for you later if it happened again.
Do you mean "xpack.monitoring.collection.enabled"? If yes, not. Do I need to open it and which metrics should I focus on?

Head into this particular node's monitoring page and see if there is anything else that correlates with the CPU spike.


Hi, @warkolm , the phenomenon happened again, and here is the image of the node's monitoring page of that node.
And here is the part of the hot_thread of that node:
{xpack.installed=true}
Hot threads at 2021-04-08T05:20:44.153Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:

87.9% (439.7ms out of 500ms) cpu usage by thread 'elasticsearch[node-1][search][T#13]'
2/10 snapshots sharing following 37 elements
app//org.apache.lucene.util.bkd.BKDReader.addAll(BKDReader.java:390)
app//org.apache.lucene.util.bkd.BKDReader.addAll(BKDReader.java:394)
app//org.apache.lucene.util.bkd.BKDReader.addAll(BKDReader.java:390)
app//org.apache.lucene.util.bkd.BKDReader.intersect(BKDReader.java:565)
app//org.apache.lucene.util.bkd.BKDReader.intersect(BKDReader.java:610)
app//org.apache.lucene.util.bkd.BKDReader.intersect(BKDReader.java:600)
app//org.apache.lucene.util.bkd.BKDReader.intersect(BKDReader.java:600)
app//org.apache.lucene.util.bkd.BKDReader.intersect(BKDReader.java:610)
app//org.apache.lucene.util.bkd.BKDReader.intersect(BKDReader.java:361)
app//org.apache.lucene.search.PointInSetQuery$1.scorer(PointInSetQuery.java:138)
app//org.apache.lucene.search.Weight.scorerSupplier(Weight.java:143)
app//org.apache.lucene.search.LRUQueryCache$CachingWrapperWeight.scorerSupplier(LRUQueryCache.java:727)
app//org.elasticsearch.indices.IndicesQueryCache$CachingWeightWrapper.scorerSupplier(IndicesQueryCache.java:157)
app//org.apache.lucene.search.BooleanWeight.scorerSupplier(BooleanWeight.java:374)
app//org.apache.lucene.search.BooleanWeight.scorer(BooleanWeight.java:340)
app//org.apache.lucene.search.Weight.bulkScorer(Weight.java:177)
app//org.apache.lucene.search.BooleanWeight.bulkScorer(BooleanWeight.java:334)
app//org.apache.lucene.search.LRUQueryCache$CachingWrapperWeight.bulkScorer(LRUQueryCache.java:808)
app//org.elasticsearch.indices.IndicesQueryCache$CachingWeightWrapper.bulkScorer(IndicesQueryCache.java:163)
app//org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:649)
app//org.elasticsearch.search.internal.ContextIndexSearcher.search(ContextIndexSearcher.java:177)
app//org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:443)
app//org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:275)
app//org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:115)
app//org.elasticsearch.search.SearchService.loadOrExecuteQueryPhase(SearchService.java:349)
app//org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:393)
app//org.elasticsearch.search.SearchService.access$100(SearchService.java:124)
app//org.elasticsearch.search.SearchService$2.onResponse(SearchService.java:358)
app//org.elasticsearch.search.SearchService$2.onResponse(SearchService.java:354)
app//org.elasticsearch.search.SearchService$4.doRun(SearchService.java:1069)
app//org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
app//org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:41)
app//org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:751)
app//org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
java.base@12.0.1/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
java.base@12.0.1/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
java.base@12.0.1/java.lang.Thread.run(Thread.java:835)
8/10 snapshots sharing following 34 elements
app//org.apache.lucene.util.bkd.BKDReader.intersect(BKDReader.java:565)
app//org.apache.lucene.util.bkd.BKDReader.intersect(BKDReader.java:610)
app//org.apache.lucene.util.bkd.BKDReader.intersect(BKDReader.java:600)
app//org.apache.lucene.util.bkd.BKDReader.intersect(BKDReader.java:600)
app//org.apache.lucene.util.bkd.BKDReader.intersect(BKDReader.java:610)
app//org.apache.lucene.util.bkd.BKDReader.intersect(BKDReader.java:361)
app//org.apache.lucene.search.PointInSetQuery$1.scorer(PointInSetQuery.java:138)
app//org.apache.lucene.search.Weight.scorerSupplier(Weight.java:143)
app//org.apache.lucene.search.LRUQueryCache$CachingWrapperWeight.scorerSupplier(LRUQueryCache.java:727)
app//org.elasticsearch.indices.IndicesQueryCache$CachingWeightWrapper.scorerSupplier(IndicesQueryCache.java:157)
app//org.apache.lucene.search.BooleanWeight.scorerSupplier(BooleanWeight.java:374)
app//org.apache.lucene.search.BooleanWeight.scorer(BooleanWeight.java:340)
app//org.apache.lucene.search.Weight.bulkScorer(Weight.java:177)
app//org.apache.lucene.search.BooleanWeight.bulkScorer(BooleanWeight.java:334)
app//org.apache.lucene.search.LRUQueryCache$CachingWrapperWeight.bulkScorer(LRUQueryCache.java:808)
app//org.elasticsearch.indices.IndicesQueryCache$CachingWeightWrapper.bulkScorer(IndicesQueryCache.java:163)
app//org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:649)
app//org.elasticsearch.search.internal.ContextIndexSearcher.search(ContextIndexSearcher.java:177)
app//org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:443)
app//org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:275)
app//org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:115)
app//org.elasticsearch.search.SearchService.loadOrExecuteQueryPhase(SearchService.java:349)
app//org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:393)
app//org.elasticsearch.search.SearchService.access$100(SearchService.java:124)
app//org.elasticsearch.search.SearchService$2.onResponse(SearchService.java:358)
app//org.elasticsearch.search.SearchService$2.onResponse(SearchService.java:354)
app//org.elasticsearch.search.SearchService$4.doRun(SearchService.java:1069)
app//org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
app//org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:41)
app//org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:751)
app//org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
java.base@12.0.1/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
java.base@12.0.1/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
java.base@12.0.1/java.lang.Thread.run(Thread.java:835)

76.7% (383.7ms out of 500ms) cpu usage by thread 'elasticsearch[node-1][search][T#10]'
2/10 snapshots sharing following 42 elements
app//org.apache.lucene.util.bkd.DocIdsWriter.readInts(DocIdsWriter.java:124)
app//org.apache.lucene.util.bkd.BKDReader.visitDocIDs(BKDReader.java:428)
app//org.apache.lucene.util.bkd.BKDReader.addAll(BKDReader.java:385)
app//org.apache.lucene.util.bkd.BKDReader.addAll(BKDReader.java:394)
app//org.apache.lucene.util.bkd.BKDReader.addAll(BKDReader.java:394)
app//org.apache.lucene.util.bkd.BKDReader.addAll(BKDReader.java:394)
app//org.apache.lucene.util.bkd.BKDReader.addAll(BKDReader.java:390)
app//org.apache.lucene.util.bkd.BKDReader.addAll(BKDReader.java:390)
app//org.apache.lucene.util.bkd.BKDReader.addAll(BKDReader.java:394)
app//org.apache.lucene.util.bkd.BKDReader.intersect(BKDReader.java:565)
app//org.apache.lucene.util.bkd.BKDReader.intersect(BKDReader.java:610)
app//org.apache.lucene.util.bkd.BKDReader.intersect(BKDReader.java:600)
app//org.apache.lucene.util.bkd.BKDReader.intersect(BKDReader.java:600)
app//org.apache.lucene.util.bkd.BKDReader.intersect(BKDReader.java:361)
app//org.apache.lucene.search.PointInSetQuery$1.scorer(PointInSetQuery.java:138)
app//org.apache.lucene.search.Weight.scorerSupplier(Weight.java:143)
app//org.apache.lucene.search.LRUQueryCache$CachingWrapperWeight.scorerSupplier(LRUQueryCache.java:727)
app//org.elasticsearch.indices.IndicesQueryCache$CachingWeightWrapper.scorerSupplier(IndicesQueryCache.java:157)
app//org.apache.lucene.search.BooleanWeight.scorerSupplier(BooleanWeight.java:374)
app//org.apache.lucene.search.BooleanWeight.scorer(BooleanWeight.java:340)
app//org.apache.lucene.search.Weight.bulkScorer(Weight.java:177)
app//org.apache.lucene.search.BooleanWeight.bulkScorer(BooleanWeight.java:334)
app//org.apache.lucene.search.LRUQueryCache$CachingWrapperWeight.bulkScorer(LRUQueryCache.java:808)
app//org.elasticsearch.indices.IndicesQueryCache$CachingWeightWrapper.bulkScorer(IndicesQueryCache.java:163)
app//org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:649)
app//org.elasticsearch.search.internal.ContextIndexSearcher.search(ContextIndexSearcher.java:177)
app//org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:443)
app//org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:275)
app//org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:115)
app//org.elasticsearch.search.SearchService.loadOrExecuteQueryPhase(SearchService.java:349)
app//org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:393)
app//org.elasticsearch.search.SearchService.access$100(SearchService.java:124)
app//org.elasticsearch.search.SearchService$2.onResponse(SearchService.java:358)
app//org.elasticsearch.search.SearchService$2.onResponse(SearchService.java:354)
app//org.elasticsearch.search.SearchService$4.doRun(SearchService.java:1069)
app//org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
app//org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:41)
app//org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:751)
app//org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
java.base@12.0.1/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
java.base@12.0.1/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
java.base@12.0.1/java.lang.Thread.run(Thread.java:835)
2/10 snapshots sharing following 44 elements
app//org.apache.lucene.store.DataInput.readVInt(DataInput.java:125)
app//org.apache.lucene.util.bkd.DocIdsWriter.readDeltaVInts(DocIdsWriter.java:140)
app//org.apache.lucene.util.bkd.DocIdsWriter.readInts(DocIdsWriter.java:124)
app//org.apache.lucene.util.bkd.BKDReader.visitDocIDs(BKDReader.java:428)
app//org.apache.lucene.util.bkd.BKDReader.addAll(BKDReader.java:385)
app//org.apache.lucene.util.bkd.BKDReader.addAll(BKDReader.java:390)
app//org.apache.lucene.util.bkd.BKDReader.addAll(BKDReader.java:394)
app//org.apache.lucene.util.bkd.BKDReader.addAll(BKDReader.java:390)
app//org.apache.lucene.util.bkd.BKDReader.addAll(BKDReader.java:394)
app//org.apache.lucene.util.bkd.BKDReader.addAll(BKDReader.java:390)
app//org.apache.lucene.util.bkd.BKDReader.addAll(BKDReader.java:394)
app//org.apache.lucene.util.bkd.BKDReader.intersect(BKDReader.java:565)
app//org.apache.lucene.util.bkd.BKDReader.intersect(BKDReader.java:610)
app//org.apache.lucene.util.bkd.BKDReader.intersect(BKDReader.java:600)
app//org.apache.lucene.util.bkd.BKDReader.intersect(BKDReader.java:600)
app//org.apache.lucene.util.bkd.BKDReader.intersect(BKDReader.java:361)
app//org.apache.lucene.search.PointInSetQuery$1.scorer(PointInSetQuery.java:138)
app//org.apache.lucene.search.Weight.scorerSupplier(Weight.java:143)
app//org.apache.lucene.search.LRUQueryCache$CachingWrapperWeight.scorerSupplier(LRUQueryCache.java:727)
app//org.elasticsearch.indices.IndicesQueryCache$CachingWeightWrapper.scorerSupplier(IndicesQueryCache.java:157)
app//org.apache.lucene.search.BooleanWeight.scorerSupplier(BooleanWeight.java:374)
app//org.apache.lucene.search.BooleanWeight.scorer(BooleanWeight.java:340)
app//org.apache.lucene.search.Weight.bulkScorer(Weight.java:177)
app//org.apache.lucene.search.BooleanWeight.bulkScorer(BooleanWeight.java:334)
app//org.apache.lucene.search.LRUQueryCache$CachingWrapperWeight.bulkScorer(LRUQueryCache.java:808)

Looks like a pretty heavy search based on that.

But the problem is that, when I remove or restart this node, and use the exactly same queries to that cluster and everything works fine. The index and query are still the same.