Elasticsearch spends all CPU time decompressing till restart

Hey,

I have a fun issue here, that I am unable to pinpoint. We have a relatively old 7.9.0 cluster (spare me be upgrade preach, I've been doing that for months for this cluster :shushing_face:), that suddenly jumps from 15% utilization across all eight data nodes to 100% (8 cores). We reindexed some index at the same time, however the CPU utilization stayed constantly high 20 hours after that indexing was finished, and mainly with queries reading from other indices. hot threads output contains the following from search threads only (no merge threads in the output):

org.apache.lucene.codecs.blocktree.IntersectTermsEnum.next
...
org.apache.lucene.util.compress.LZ4.decompress

(I don't have the full hot threads output at the moment).

Another interesting part is the increased disk reads during that time.

Rebooting all VMs and thus restarting Elasticsearch solved this for the whole cluster.

My current working assumption: Some cache (indices, query cache) stops working, requiring disk access for more queries, requiring decompression for more queries, resulting in 100% CPU usage.

What I do not understand: Why is that happening at the same time across all nodes, if it is some cache that goes broke?

If anyone has further ideas, that I am missing or if there is a known bug, feel free to share :slight_smile:

--Alex

Hey,

didn't took a lot of time for this to happen again, so here is the hot threads output of a node

   100.1% (500.7ms out of 500ms) cpu usage by thread 'elasticsearch[NODE_NAME_GOES_HERE][search][T#8]'
     6/10 snapshots sharing following 44 elements
       app//org.apache.lucene.codecs.blocktree.IntersectTermsEnum._next(IntersectTermsEnum.java:495)
       app//org.apache.lucene.codecs.blocktree.IntersectTermsEnum.next(IntersectTermsEnum.java:355)
       app//org.elasticsearch.search.internal.ExitableDirectoryReader$ExitableTermsEnum.next(ExitableDirectoryReader.java:173)
       app//org.apache.lucene.search.MultiTermQueryConstantScoreWrapper$1.collectTerms(MultiTermQueryConstantScoreWrapper.java:120)
       app//org.apache.lucene.search.MultiTermQueryConstantScoreWrapper$1.rewrite(MultiTermQueryConstantScoreWrapper.java:147)
       app//org.apache.lucene.search.MultiTermQueryConstantScoreWrapper$1.scorer(MultiTermQueryConstantScoreWrapper.java:219)
       app//org.apache.lucene.search.Weight.scorerSupplier(Weight.java:147)
       app//org.apache.lucene.search.LRUQueryCache$CachingWrapperWeight.scorerSupplier(LRUQueryCache.java:783)
       app//org.elasticsearch.indices.IndicesQueryCache$CachingWeightWrapper.scorerSupplier(IndicesQueryCache.java:157)
       app//org.apache.lucene.search.BooleanWeight.scorerSupplier(BooleanWeight.java:379)
       app//org.apache.lucene.search.LRUQueryCache$CachingWrapperWeight.scorerSupplier(LRUQueryCache.java:721)
       app//org.elasticsearch.indices.IndicesQueryCache$CachingWeightWrapper.scorerSupplier(IndicesQueryCache.java:157)
       app//org.apache.lucene.search.BooleanWeight.scorerSupplier(BooleanWeight.java:379)
       app//org.apache.lucene.search.BooleanWeight.scorer(BooleanWeight.java:344)
       app//org.apache.lucene.search.Weight.bulkScorer(Weight.java:181)
       app//org.apache.lucene.search.BooleanWeight.bulkScorer(BooleanWeight.java:338)
       app//org.apache.lucene.search.LRUQueryCache$CachingWrapperWeight.bulkScorer(LRUQueryCache.java:832)
       app//org.elasticsearch.indices.IndicesQueryCache$CachingWeightWrapper.bulkScorer(IndicesQueryCache.java:163)
       app//org.apache.lucene.search.BooleanWeight.optionalBulkScorer(BooleanWeight.java:201)
       app//org.apache.lucene.search.BooleanWeight.booleanScorer(BooleanWeight.java:258)
       app//org.apache.lucene.search.BooleanWeight.bulkScorer(BooleanWeight.java:332)
       app//org.apache.lucene.search.LRUQueryCache$CachingWrapperWeight.bulkScorer(LRUQueryCache.java:832)
       app//org.elasticsearch.indices.IndicesQueryCache$CachingWeightWrapper.bulkScorer(IndicesQueryCache.java:163)
       app//org.apache.lucene.search.ConstantScoreQuery$1.bulkScorer(ConstantScoreQuery.java:121)
       app//org.elasticsearch.search.internal.ContextIndexSearcher$1.bulkScorer(ContextIndexSearcher.java:258)
       app//org.elasticsearch.search.internal.ContextIndexSearcher.searchLeaf(ContextIndexSearcher.java:209)
       app//org.elasticsearch.search.internal.ContextIndexSearcher.search(ContextIndexSearcher.java:185)
       app//org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:445)
       app//org.elasticsearch.search.query.QueryPhase.searchWithCollector(QueryPhase.java:343)
       app//org.elasticsearch.search.query.QueryPhase.executeInternal(QueryPhase.java:298)
       app//org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:150)
       app//org.elasticsearch.search.SearchService.loadOrExecuteQueryPhase(SearchService.java:362)
       app//org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:435)
       app//org.elasticsearch.search.SearchService.access$200(SearchService.java:136)
       app//org.elasticsearch.search.SearchService$2.lambda$onResponse$0(SearchService.java:396)
       app//org.elasticsearch.search.SearchService$2$$Lambda$5355/0x0000000801ac6840.get(Unknown Source)
       app//org.elasticsearch.search.SearchService.lambda$runAsync$0(SearchService.java:412)
       app//org.elasticsearch.search.SearchService$$Lambda$5356/0x0000000801ac6c40.run(Unknown Source)
       app//org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:44)
       app//org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:710)
       app//org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
       java.base@14.0.1/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
       java.base@14.0.1/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
       java.base@14.0.1/java.lang.Thread.run(Thread.java:832)
     3/10 snapshots sharing following 48 elements
       app//org.apache.lucene.util.compress.LZ4.decompress(LZ4.java:137)
       app//org.apache.lucene.codecs.blocktree.CompressionAlgorithm$3.read(CompressionAlgorithm.java:51)
       app//org.apache.lucene.codecs.blocktree.IntersectTermsEnumFrame.load(IntersectTermsEnumFrame.java:201)
       app//org.apache.lucene.codecs.blocktree.IntersectTermsEnum.pushFrame(IntersectTermsEnum.java:196)
       app//org.apache.lucene.codecs.blocktree.IntersectTermsEnum._next(IntersectTermsEnum.java:512)
       app//org.apache.lucene.codecs.blocktree.IntersectTermsEnum.next(IntersectTermsEnum.java:355)
       app//org.elasticsearch.search.internal.ExitableDirectoryReader$ExitableTermsEnum.next(ExitableDirectoryReader.java:173)
       app//org.apache.lucene.search.MultiTermQueryConstantScoreWrapper$1.collectTerms(MultiTermQueryConstantScoreWrapper.java:120)
       app//org.apache.lucene.search.MultiTermQueryConstantScoreWrapper$1.rewrite(MultiTermQueryConstantScoreWrapper.java:147)
       app//org.apache.lucene.search.MultiTermQueryConstantScoreWrapper$1.scorer(MultiTermQueryConstantScoreWrapper.java:219)
       app//org.apache.lucene.search.Weight.scorerSupplier(Weight.java:147)
       app//org.apache.lucene.search.LRUQueryCache$CachingWrapperWeight.scorerSupplier(LRUQueryCache.java:783)
       app//org.elasticsearch.indices.IndicesQueryCache$CachingWeightWrapper.scorerSupplier(IndicesQueryCache.java:157)
       app//org.apache.lucene.search.BooleanWeight.scorerSupplier(BooleanWeight.java:379)
       app//org.apache.lucene.search.LRUQueryCache$CachingWrapperWeight.scorerSupplier(LRUQueryCache.java:721)
       app//org.elasticsearch.indices.IndicesQueryCache$CachingWeightWrapper.scorerSupplier(IndicesQueryCache.java:157)
       app//org.apache.lucene.search.BooleanWeight.scorerSupplier(BooleanWeight.java:379)
       app//org.apache.lucene.search.BooleanWeight.scorer(BooleanWeight.java:344)
       app//org.apache.lucene.search.Weight.bulkScorer(Weight.java:181)
       app//org.apache.lucene.search.BooleanWeight.bulkScorer(BooleanWeight.java:338)
       app//org.apache.lucene.search.LRUQueryCache$CachingWrapperWeight.bulkScorer(LRUQueryCache.java:832)
       app//org.elasticsearch.indices.IndicesQueryCache$CachingWeightWrapper.bulkScorer(IndicesQueryCache.java:163)
       app//org.apache.lucene.search.BooleanWeight.optionalBulkScorer(BooleanWeight.java:201)
       app//org.apache.lucene.search.BooleanWeight.booleanScorer(BooleanWeight.java:258)
       app//org.apache.lucene.search.BooleanWeight.bulkScorer(BooleanWeight.java:332)
       app//org.apache.lucene.search.LRUQueryCache$CachingWrapperWeight.bulkScorer(LRUQueryCache.java:832)
       app//org.elasticsearch.indices.IndicesQueryCache$CachingWeightWrapper.bulkScorer(IndicesQueryCache.java:163)
       app//org.apache.lucene.search.ConstantScoreQuery$1.bulkScorer(ConstantScoreQuery.java:121)
       app//org.elasticsearch.search.internal.ContextIndexSearcher$1.bulkScorer(ContextIndexSearcher.java:258)
       app//org.elasticsearch.search.internal.ContextIndexSearcher.searchLeaf(ContextIndexSearcher.java:209)
       app//org.elasticsearch.search.internal.ContextIndexSearcher.search(ContextIndexSearcher.java:185)
       app//org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:445)
       app//org.elasticsearch.search.query.QueryPhase.searchWithCollector(QueryPhase.java:343)
       app//org.elasticsearch.search.query.QueryPhase.executeInternal(QueryPhase.java:298)
       app//org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:150)
       app//org.elasticsearch.search.SearchService.loadOrExecuteQueryPhase(SearchService.java:362)
       app//org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:435)
       app//org.elasticsearch.search.SearchService.access$200(SearchService.java:136)
       app//org.elasticsearch.search.SearchService$2.lambda$onResponse$0(SearchService.java:396)
       app//org.elasticsearch.search.SearchService$2$$Lambda$5355/0x0000000801ac6840.get(Unknown Source)
       app//org.elasticsearch.search.SearchService.lambda$runAsync$0(SearchService.java:412)
       app//org.elasticsearch.search.SearchService$$Lambda$5356/0x0000000801ac6c40.run(Unknown Source)
       app//org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:44)
       app//org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:710)
       app//org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
       java.base@14.0.1/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
       java.base@14.0.1/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
       java.base@14.0.1/java.lang.Thread.run(Thread.java:832)
     unique snapshot
       app//org.apache.lucene.util.compress.LZ4.decompress(LZ4.java:103)
       app//org.apache.lucene.codecs.blocktree.CompressionAlgorithm$3.read(CompressionAlgorithm.java:51)
       app//org.apache.lucene.codecs.blocktree.IntersectTermsEnumFrame.load(IntersectTermsEnumFrame.java:201)
       app//org.apache.lucene.codecs.blocktree.IntersectTermsEnum.pushFrame(IntersectTermsEnum.java:196)
       app//org.apache.lucene.codecs.blocktree.IntersectTermsEnum._next(IntersectTermsEnum.java:512)
       app//org.apache.lucene.codecs.blocktree.IntersectTermsEnum.next(IntersectTermsEnum.java:355)
       app//org.elasticsearch.search.internal.ExitableDirectoryReader$ExitableTermsEnum.next(ExitableDirectoryReader.java:173)
       app//org.apache.lucene.search.MultiTermQueryConstantScoreWrapper$1.collectTerms(MultiTermQueryConstantScoreWrapper.java:120)
       app//org.apache.lucene.search.MultiTermQueryConstantScoreWrapper$1.rewrite(MultiTermQueryConstantScoreWrapper.java:147)
       app//org.apache.lucene.search.MultiTermQueryConstantScoreWrapper$1.scorer(MultiTermQueryConstantScoreWrapper.java:219)
       app//org.apache.lucene.search.Weight.scorerSupplier(Weight.java:147)
       app//org.apache.lucene.search.LRUQueryCache$CachingWrapperWeight.scorerSupplier(LRUQueryCache.java:783)
       app//org.elasticsearch.indices.IndicesQueryCache$CachingWeightWrapper.scorerSupplier(IndicesQueryCache.java:157)
       app//org.apache.lucene.search.BooleanWeight.scorerSupplier(BooleanWeight.java:379)
       app//org.apache.lucene.search.LRUQueryCache$CachingWrapperWeight.scorerSupplier(LRUQueryCache.java:721)
       app//org.elasticsearch.indices.IndicesQueryCache$CachingWeightWrapper.scorerSupplier(IndicesQueryCache.java:157)
       app//org.apache.lucene.search.BooleanWeight.scorerSupplier(BooleanWeight.java:379)
       app//org.apache.lucene.search.BooleanWeight.scorer(BooleanWeight.java:344)
       app//org.apache.lucene.search.Weight.bulkScorer(Weight.java:181)
       app//org.apache.lucene.search.BooleanWeight.bulkScorer(BooleanWeight.java:338)
       app//org.apache.lucene.search.LRUQueryCache$CachingWrapperWeight.bulkScorer(LRUQueryCache.java:832)
       app//org.elasticsearch.indices.IndicesQueryCache$CachingWeightWrapper.bulkScorer(IndicesQueryCache.java:163)
       app//org.apache.lucene.search.BooleanWeight.optionalBulkScorer(BooleanWeight.java:201)
       app//org.apache.lucene.search.BooleanWeight.booleanScorer(BooleanWeight.java:258)
       app//org.apache.lucene.search.BooleanWeight.bulkScorer(BooleanWeight.java:332)
       app//org.apache.lucene.search.LRUQueryCache$CachingWrapperWeight.bulkScorer(LRUQueryCache.java:832)
       app//org.elasticsearch.indices.IndicesQueryCache$CachingWeightWrapper.bulkScorer(IndicesQueryCache.java:163)
       app//org.apache.lucene.search.ConstantScoreQuery$1.bulkScorer(ConstantScoreQuery.java:121)
       app//org.elasticsearch.search.internal.ContextIndexSearcher$1.bulkScorer(ContextIndexSearcher.java:258)
       app//org.elasticsearch.search.internal.ContextIndexSearcher.searchLeaf(ContextIndexSearcher.java:209)
       app//org.elasticsearch.search.internal.ContextIndexSearcher.search(ContextIndexSearcher.java:185)
       app//org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:445)
       app//org.elasticsearch.search.query.QueryPhase.searchWithCollector(QueryPhase.java:343)
       app//org.elasticsearch.search.query.QueryPhase.executeInternal(QueryPhase.java:298)
       app//org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:150)
       app//org.elasticsearch.search.SearchService.loadOrExecuteQueryPhase(SearchService.java:362)
       app//org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:435)
       app//org.elasticsearch.search.SearchService.access$200(SearchService.java:136)
       app//org.elasticsearch.search.SearchService$2.lambda$onResponse$0(SearchService.java:396)
       app//org.elasticsearch.search.SearchService$2$$Lambda$5355/0x0000000801ac6840.get(Unknown Source)
       app//org.elasticsearch.search.SearchService.lambda$runAsync$0(SearchService.java:412)
       app//org.elasticsearch.search.SearchService$$Lambda$5356/0x0000000801ac6c40.run(Unknown Source)
       app//org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:44)
       app//org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:710)
       app//org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
       java.base@14.0.1/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
       java.base@14.0.1/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
       java.base@14.0.1/java.lang.Thread.run(Thread.java:832)

I have found a certain query, that is doing a lot of wildcard queries against several fields like like *term* and I think such a query may be the cause of this, as it basically has to walk through the entire inverted index.

Happy if anyone agrees or disagrees :wink:

Are the fields mapped as keyword?

Not sure if the same issue, but from time to time the CPU on my hot nodes increases a lot when someone is making queries on keyword fields using a leading asterisk.