High CPU usage on only 1 Data node

Hi there,

I'm trying to understand why we sometimes get high CPU utilisation on only one Data node of our cluster. We mainly use 2 indexes (80GB for one, 500MB for the second) with 5 shards and one replica each which seems to be evenly distributed across our 3 nodes.
We sometimes see one node CPU getting stuck at 100% while the other ones are at around 25%, and it seems to be always the same node...

Any idea?
I have a hot_thread snapshot that I can share if it helps.
Our ES version is 6.8.8

Yes, please share the hot threads output.

(Splitting in several messages, the text is too long)

::: {instance-0000000002}{L6uqrHubT66TurycvCyEqg}{jZaNbG7yQlGwd5KrW2_UVw}{10.0.42.65}{10.0.42.65:19936}{logical_availability_zone=zone-2, server_name=instance-0000000002.7010a5b641be40388118884e5d60a284, availability_zone=ap-southeast-2c, xpack.installed=true, region=ap-southeast-2, instance_configuration=aws.data.highio.i3}
   Hot threads at 2020-09-04T13:18:53.275Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:
   
   83.2% (416.2ms out of 500ms) cpu usage by thread 'elasticsearch[instance-0000000002][search][T#3]'
     7/10 snapshots sharing following 32 elements
       java.nio.Bits.copyToArray(Bits.java:836)
       java.nio.DirectByteBuffer.get(DirectByteBuffer.java:279)
       org.apache.lucene.store.ByteBufferGuard.getBytes(ByteBufferGuard.java:93)
       org.apache.lucene.store.ByteBufferIndexInput.readBytes(ByteBufferIndexInput.java:89)
       org.apache.lucene.codecs.blocktree.IntersectTermsEnumFrame.load(IntersectTermsEnumFrame.java:194)
       org.apache.lucene.codecs.blocktree.IntersectTermsEnum.pushFrame(IntersectTermsEnum.java:208)
       org.apache.lucene.codecs.blocktree.IntersectTermsEnum._next(IntersectTermsEnum.java:662)
       org.apache.lucene.codecs.blocktree.IntersectTermsEnum.next(IntersectTermsEnum.java:497)
       org.apache.lucene.search.FuzzyTermsEnum.next(FuzzyTermsEnum.java:211)
       org.apache.lucene.search.TermCollectingRewrite.collectTerms(TermCollectingRewrite.java:67)
       org.apache.lucene.search.TopTermsRewrite.rewrite(TopTermsRewrite.java:67)
       org.apache.lucene.search.MultiTermQuery.rewrite(MultiTermQuery.java:310)
       org.apache.lucene.search.DisjunctionMaxQuery.rewrite(DisjunctionMaxQuery.java:219)
       org.apache.lucene.search.BooleanQuery.rewrite(BooleanQuery.java:246)
       org.apache.lucene.search.IndexSearcher.rewrite(IndexSearcher.java:685)
       org.elasticsearch.search.internal.ContextIndexSearcher.rewrite(ContextIndexSearcher.java:106)
       org.elasticsearch.search.DefaultSearchContext.preProcess(DefaultSearchContext.java:263)
       org.elasticsearch.search.query.QueryPhase.preProcess(QueryPhase.java:91)
       org.elasticsearch.search.SearchService.createContext(SearchService.java:660)
       org.elasticsearch.search.SearchService.createAndPutContext(SearchService.java:599)
       org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:387)
       org.elasticsearch.search.SearchService.access$100(SearchService.java:126)
       org.elasticsearch.search.SearchService$2.onResponse(SearchService.java:359)
       org.elasticsearch.search.SearchService$2.onResponse(SearchService.java:355)
       org.elasticsearch.search.SearchService$4.doRun(SearchService.java:1117)
       org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
       org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:41)
       org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:751)
       org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
       java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
       java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
       java.lang.Thread.run(Thread.java:748)
     3/10 snapshots sharing following 25 elements
       org.apache.lucene.codecs.blocktree.IntersectTermsEnum.next(IntersectTermsEnum.java:497)
       org.apache.lucene.search.FuzzyTermsEnum.next(FuzzyTermsEnum.java:211)
       org.apache.lucene.search.TermCollectingRewrite.collectTerms(TermCollectingRewrite.java:67)
       org.apache.lucene.search.TopTermsRewrite.rewrite(TopTermsRewrite.java:67)
       org.apache.lucene.search.MultiTermQuery.rewrite(MultiTermQuery.java:310)
       org.apache.lucene.search.DisjunctionMaxQuery.rewrite(DisjunctionMaxQuery.java:219)
       org.apache.lucene.search.BooleanQuery.rewrite(BooleanQuery.java:246)
       org.apache.lucene.search.IndexSearcher.rewrite(IndexSearcher.java:685)
       org.elasticsearch.search.internal.ContextIndexSearcher.rewrite(ContextIndexSearcher.java:106)
       org.elasticsearch.search.DefaultSearchContext.preProcess(DefaultSearchContext.java:263)
       org.elasticsearch.search.query.QueryPhase.preProcess(QueryPhase.java:91)
       org.elasticsearch.search.SearchService.createContext(SearchService.java:660)
       org.elasticsearch.search.SearchService.createAndPutContext(SearchService.java:599)
       org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:387)
       org.elasticsearch.search.SearchService.access$100(SearchService.java:126)
       org.elasticsearch.search.SearchService$2.onResponse(SearchService.java:359)
       org.elasticsearch.search.SearchService$2.onResponse(SearchService.java:355)
       org.elasticsearch.search.SearchService$4.doRun(SearchService.java:1117)
       org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
       org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:41)
       org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:751)
       org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
       java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
       java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
       java.lang.Thread.run(Thread.java:748)
   
   64.5% (322.6ms out of 500ms) cpu usage by thread 'elasticsearch[instance-0000000002][search][T#4]'
     9/10 snapshots sharing following 25 elements
       org.apache.lucene.codecs.blocktree.IntersectTermsEnum.next(IntersectTermsEnum.java:497)
       org.apache.lucene.search.FuzzyTermsEnum.next(FuzzyTermsEnum.java:211)
       org.apache.lucene.search.TermCollectingRewrite.collectTerms(TermCollectingRewrite.java:67)
       org.apache.lucene.search.TopTermsRewrite.rewrite(TopTermsRewrite.java:67)
       org.apache.lucene.search.MultiTermQuery.rewrite(MultiTermQuery.java:310)
       org.apache.lucene.search.DisjunctionMaxQuery.rewrite(DisjunctionMaxQuery.java:219)
       org.apache.lucene.search.BooleanQuery.rewrite(BooleanQuery.java:246)
       org.apache.lucene.search.IndexSearcher.rewrite(IndexSearcher.java:685)
       org.elasticsearch.search.internal.ContextIndexSearcher.rewrite(ContextIndexSearcher.java:106)
       org.elasticsearch.search.DefaultSearchContext.preProcess(DefaultSearchContext.java:263)
       org.elasticsearch.search.query.QueryPhase.preProcess(QueryPhase.java:91)
       org.elasticsearch.search.SearchService.createContext(SearchService.java:660)
       org.elasticsearch.search.SearchService.createAndPutContext(SearchService.java:599)
       org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:387)
       org.elasticsearch.search.SearchService.access$100(SearchService.java:126)
       org.elasticsearch.search.SearchService$2.onResponse(SearchService.java:359)
       org.elasticsearch.search.SearchService$2.onResponse(SearchService.java:355)
       org.elasticsearch.search.SearchService$4.doRun(SearchService.java:1117)
       org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
       org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:41)
       org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:751)
       org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
       java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
       java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
       java.lang.Thread.run(Thread.java:748)
     unique snapshot
       java.nio.Bits.copyToArray(Bits.java:836)
       java.nio.DirectByteBuffer.get(DirectByteBuffer.java:279)
       org.apache.lucene.store.ByteBufferGuard.getBytes(ByteBufferGuard.java:93)
       org.apache.lucene.store.ByteBufferIndexInput.readBytes(ByteBufferIndexInput.java:89)
       org.apache.lucene.codecs.blocktree.IntersectTermsEnumFrame.load(IntersectTermsEnumFrame.java:194)
       org.apache.lucene.codecs.blocktree.IntersectTermsEnum.<init>(IntersectTermsEnum.java:127)
       org.apache.lucene.codecs.blocktree.FieldReader.intersect(FieldReader.java:188)
       org.apache.lucene.search.FuzzyTermsEnum.getAutomatonEnum(FuzzyTermsEnum.java:169)
       org.apache.lucene.search.FuzzyTermsEnum.bottomChanged(FuzzyTermsEnum.java:196)
       org.apache.lucene.search.FuzzyTermsEnum.<init>(FuzzyTermsEnum.java:151)
       org.apache.lucene.search.FuzzyQuery.getTermsEnum(FuzzyQuery.java:154)
       org.apache.lucene.search.MultiTermQuery$RewriteMethod.getTermsEnum(MultiTermQuery.java:78)
       org.apache.lucene.search.TermCollectingRewrite.collectTerms(TermCollectingRewrite.java:58)
       org.apache.lucene.search.TopTermsRewrite.rewrite(TopTermsRewrite.java:67)
       org.apache.lucene.search.MultiTermQuery.rewrite(MultiTermQuery.java:310)
       org.apache.lucene.search.DisjunctionMaxQuery.rewrite(DisjunctionMaxQuery.java:219)
       org.apache.lucene.search.BooleanQuery.rewrite(BooleanQuery.java:246)
       org.apache.lucene.search.IndexSearcher.rewrite(IndexSearcher.java:685)
       org.elasticsearch.search.internal.ContextIndexSearcher.rewrite(ContextIndexSearcher.java:106)
       org.elasticsearch.search.DefaultSearchContext.preProcess(DefaultSearchContext.java:263)
       org.elasticsearch.search.query.QueryPhase.preProcess(QueryPhase.java:91)
       org.elasticsearch.search.SearchService.createContext(SearchService.java:660)
       org.elasticsearch.search.SearchService.createAndPutContext(SearchService.java:599)
       org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:387)
       org.elasticsearch.search.SearchService.access$100(SearchService.java:126)
       org.elasticsearch.search.SearchService$2.onResponse(SearchService.java:359)
       org.elasticsearch.search.SearchService$2.onResponse(SearchService.java:355)
       org.elasticsearch.search.SearchService$4.doRun(SearchService.java:1117)
       org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
       org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:41)
       org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:751)
       org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
       java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
       java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
       java.lang.Thread.run(Thread.java:748)

instance-0000000002 continued:

57.0% (285.1ms out of 500ms) cpu usage by thread 'elasticsearch[instance-0000000002][search][T#2]'
     9/10 snapshots sharing following 25 elements
       org.apache.lucene.codecs.blocktree.IntersectTermsEnum.next(IntersectTermsEnum.java:497)
       org.apache.lucene.search.FuzzyTermsEnum.next(FuzzyTermsEnum.java:211)
       org.apache.lucene.search.TermCollectingRewrite.collectTerms(TermCollectingRewrite.java:67)
       org.apache.lucene.search.TopTermsRewrite.rewrite(TopTermsRewrite.java:67)
       org.apache.lucene.search.MultiTermQuery.rewrite(MultiTermQuery.java:310)
       org.apache.lucene.search.DisjunctionMaxQuery.rewrite(DisjunctionMaxQuery.java:219)
       org.apache.lucene.search.BooleanQuery.rewrite(BooleanQuery.java:246)
       org.apache.lucene.search.IndexSearcher.rewrite(IndexSearcher.java:685)
       org.elasticsearch.search.internal.ContextIndexSearcher.rewrite(ContextIndexSearcher.java:106)
       org.elasticsearch.search.DefaultSearchContext.preProcess(DefaultSearchContext.java:263)
       org.elasticsearch.search.query.QueryPhase.preProcess(QueryPhase.java:91)
       org.elasticsearch.search.SearchService.createContext(SearchService.java:660)
       org.elasticsearch.search.SearchService.createAndPutContext(SearchService.java:599)
       org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:387)
       org.elasticsearch.search.SearchService.access$100(SearchService.java:126)
       org.elasticsearch.search.SearchService$2.onResponse(SearchService.java:359)
       org.elasticsearch.search.SearchService$2.onResponse(SearchService.java:355)
       org.elasticsearch.search.SearchService$4.doRun(SearchService.java:1117)
       org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
       org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:41)
       org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:751)
       org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
       java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
       java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
       java.lang.Thread.run(Thread.java:748)
     unique snapshot
       java.nio.Bits.copyToArray(Bits.java:836)
       java.nio.DirectByteBuffer.get(DirectByteBuffer.java:279)
       org.apache.lucene.store.ByteBufferGuard.getBytes(ByteBufferGuard.java:93)
       org.apache.lucene.store.ByteBufferIndexInput.readBytes(ByteBufferIndexInput.java:89)
       org.apache.lucene.codecs.blocktree.IntersectTermsEnumFrame.load(IntersectTermsEnumFrame.java:194)
       org.apache.lucene.codecs.blocktree.IntersectTermsEnum.<init>(IntersectTermsEnum.java:127)
       org.apache.lucene.codecs.blocktree.FieldReader.intersect(FieldReader.java:188)
       org.apache.lucene.search.FuzzyTermsEnum.getAutomatonEnum(FuzzyTermsEnum.java:169)
       org.apache.lucene.search.FuzzyTermsEnum.bottomChanged(FuzzyTermsEnum.java:196)
       org.apache.lucene.search.FuzzyTermsEnum.<init>(FuzzyTermsEnum.java:151)
       org.apache.lucene.search.FuzzyQuery.getTermsEnum(FuzzyQuery.java:154)
       org.apache.lucene.search.MultiTermQuery$RewriteMethod.getTermsEnum(MultiTermQuery.java:78)
       org.apache.lucene.search.TermCollectingRewrite.collectTerms(TermCollectingRewrite.java:58)
       org.apache.lucene.search.TopTermsRewrite.rewrite(TopTermsRewrite.java:67)
       org.apache.lucene.search.MultiTermQuery.rewrite(MultiTermQuery.java:310)
       org.apache.lucene.search.DisjunctionMaxQuery.rewrite(DisjunctionMaxQuery.java:219)
       org.apache.lucene.search.BooleanQuery.rewrite(BooleanQuery.java:246)
       org.apache.lucene.search.IndexSearcher.rewrite(IndexSearcher.java:685)
       org.elasticsearch.search.internal.ContextIndexSearcher.rewrite(ContextIndexSearcher.java:106)
       org.elasticsearch.search.DefaultSearchContext.preProcess(DefaultSearchContext.java:263)
       org.elasticsearch.search.query.QueryPhase.preProcess(QueryPhase.java:91)
       org.elasticsearch.search.SearchService.createContext(SearchService.java:660)
       org.elasticsearch.search.SearchService.createAndPutContext(SearchService.java:599)
       org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:387)
       org.elasticsearch.search.SearchService.access$100(SearchService.java:126)
       org.elasticsearch.search.SearchService$2.onResponse(SearchService.java:359)
       org.elasticsearch.search.SearchService$2.onResponse(SearchService.java:355)
       org.elasticsearch.search.SearchService$4.doRun(SearchService.java:1117)
       org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
       org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:41)
       org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:751)
       org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
       java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
       java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
       java.lang.Thread.run(Thread.java:748)

And the 2 other instances

::: {instance-0000000001}{VACHH-0hTn2DkDifEWdw2A}{tPKYBXW0QduX6FQ-jVO1fw}{10.0.30.247}{10.0.30.247:19594}{logical_availability_zone=zone-1, server_name=instance-0000000001.7010a5b641be40388118884e5d60a284, availability_zone=ap-southeast-2b, xpack.installed=true, instance_configuration=aws.data.highio.i3, region=ap-southeast-2}
   Hot threads at 2020-09-04T13:18:53.274Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:

::: {instance-0000000000}{28gdCCBDTV2HMa1RcbU2_Q}{7lBc1d7pQiqYAnxjRI1Mmg}{10.0.10.217}{10.0.10.217:19331}{logical_availability_zone=zone-0, server_name=instance-0000000000.7010a5b641be40388118884e5d60a284, availability_zone=ap-southeast-2a, xpack.installed=true, region=ap-southeast-2, instance_configuration=aws.data.highio.i3}
   Hot threads at 2020-09-04T13:18:53.275Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:

It looks like that node is busy serving an expensive fuzzy search request. Do you have any expensive searches running against indices with shards primarily on this node? Are all your shards and replicas assigned and evenly distributed across the cluster? Do you use preference when you query the cluster? Do all nodes have the same hardware specification?

Yes, all nodes have the same specifications, shards and replicas are evenly distributed, no preference when we query the cluster as far as I can tell.
During the test which resulted in this snapshot (this was on a cluster we created for the occasion, copy from our Prod ), we were only running 2 types of search and 1 type of indexing with different inputs so I can't think of any specific search which would be more complex than another.

Just a thought, what if I have a lot of document deletions going on in parallel, could that be a lead?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.