High CPU usage on only 1 Data node

Hi there,

I'm trying to understand why we sometimes get high CPU utilisation on only one Data node of our cluster. We mainly use 2 indexes (80GB for one, 500MB for the second) with 5 shards and one replica each which seems to be evenly distributed across our 3 nodes.
We sometimes see one node CPU getting stuck at 100% while the other ones are at around 25%, and it seems to be always the same node...

Any idea?
I have a hot_thread snapshot that I can share if it helps.
Our ES version is 6.8.8

Yes, please share the hot threads output.

(Splitting in several messages, the text is too long)

::: {instance-0000000002}{L6uqrHubT66TurycvCyEqg}{jZaNbG7yQlGwd5KrW2_UVw}{10.0.42.65}{10.0.42.65:19936}{logical_availability_zone=zone-2, server_name=instance-0000000002.7010a5b641be40388118884e5d60a284, availability_zone=ap-southeast-2c, xpack.installed=true, region=ap-southeast-2, instance_configuration=aws.data.highio.i3}
   Hot threads at 2020-09-04T13:18:53.275Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:
   
   83.2% (416.2ms out of 500ms) cpu usage by thread 'elasticsearch[instance-0000000002][search][T#3]'
     7/10 snapshots sharing following 32 elements
       java.nio.Bits.copyToArray(Bits.java:836)
       java.nio.DirectByteBuffer.get(DirectByteBuffer.java:279)
       org.apache.lucene.store.ByteBufferGuard.getBytes(ByteBufferGuard.java:93)
       org.apache.lucene.store.ByteBufferIndexInput.readBytes(ByteBufferIndexInput.java:89)
       org.apache.lucene.codecs.blocktree.IntersectTermsEnumFrame.load(IntersectTermsEnumFrame.java:194)
       org.apache.lucene.codecs.blocktree.IntersectTermsEnum.pushFrame(IntersectTermsEnum.java:208)
       org.apache.lucene.codecs.blocktree.IntersectTermsEnum._next(IntersectTermsEnum.java:662)
       org.apache.lucene.codecs.blocktree.IntersectTermsEnum.next(IntersectTermsEnum.java:497)
       org.apache.lucene.search.FuzzyTermsEnum.next(FuzzyTermsEnum.java:211)
       org.apache.lucene.search.TermCollectingRewrite.collectTerms(TermCollectingRewrite.java:67)
       org.apache.lucene.search.TopTermsRewrite.rewrite(TopTermsRewrite.java:67)
       org.apache.lucene.search.MultiTermQuery.rewrite(MultiTermQuery.java:310)
       org.apache.lucene.search.DisjunctionMaxQuery.rewrite(DisjunctionMaxQuery.java:219)
       org.apache.lucene.search.BooleanQuery.rewrite(BooleanQuery.java:246)
       org.apache.lucene.search.IndexSearcher.rewrite(IndexSearcher.java:685)
       org.elasticsearch.search.internal.ContextIndexSearcher.rewrite(ContextIndexSearcher.java:106)
       org.elasticsearch.search.DefaultSearchContext.preProcess(DefaultSearchContext.java:263)
       org.elasticsearch.search.query.QueryPhase.preProcess(QueryPhase.java:91)
       org.elasticsearch.search.SearchService.createContext(SearchService.java:660)
       org.elasticsearch.search.SearchService.createAndPutContext(SearchService.java:599)
       org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:387)
       org.elasticsearch.search.SearchService.access$100(SearchService.java:126)
       org.elasticsearch.search.SearchService$2.onResponse(SearchService.java:359)
       org.elasticsearch.search.SearchService$2.onResponse(SearchService.java:355)
       org.elasticsearch.search.SearchService$4.doRun(SearchService.java:1117)
       org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
       org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:41)
       org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:751)
       org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
       java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
       java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
       java.lang.Thread.run(Thread.java:748)
     3/10 snapshots sharing following 25 elements
       org.apache.lucene.codecs.blocktree.IntersectTermsEnum.next(IntersectTermsEnum.java:497)
       org.apache.lucene.search.FuzzyTermsEnum.next(FuzzyTermsEnum.java:211)
       org.apache.lucene.search.TermCollectingRewrite.collectTerms(TermCollectingRewrite.java:67)
       org.apache.lucene.search.TopTermsRewrite.rewrite(TopTermsRewrite.java:67)
       org.apache.lucene.search.MultiTermQuery.rewrite(MultiTermQuery.java:310)
       org.apache.lucene.search.DisjunctionMaxQuery.rewrite(DisjunctionMaxQuery.java:219)
       org.apache.lucene.search.BooleanQuery.rewrite(BooleanQuery.java:246)
       org.apache.lucene.search.IndexSearcher.rewrite(IndexSearcher.java:685)
       org.elasticsearch.search.internal.ContextIndexSearcher.rewrite(ContextIndexSearcher.java:106)
       org.elasticsearch.search.DefaultSearchContext.preProcess(DefaultSearchContext.java:263)
       org.elasticsearch.search.query.QueryPhase.preProcess(QueryPhase.java:91)
       org.elasticsearch.search.SearchService.createContext(SearchService.java:660)
       org.elasticsearch.search.SearchService.createAndPutContext(SearchService.java:599)
       org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:387)
       org.elasticsearch.search.SearchService.access$100(SearchService.java:126)
       org.elasticsearch.search.SearchService$2.onResponse(SearchService.java:359)
       org.elasticsearch.search.SearchService$2.onResponse(SearchService.java:355)
       org.elasticsearch.search.SearchService$4.doRun(SearchService.java:1117)
       org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
       org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:41)
       org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:751)
       org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
       java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
       java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
       java.lang.Thread.run(Thread.java:748)
   
   64.5% (322.6ms out of 500ms) cpu usage by thread 'elasticsearch[instance-0000000002][search][T#4]'
     9/10 snapshots sharing following 25 elements
       org.apache.lucene.codecs.blocktree.IntersectTermsEnum.next(IntersectTermsEnum.java:497)
       org.apache.lucene.search.FuzzyTermsEnum.next(FuzzyTermsEnum.java:211)
       org.apache.lucene.search.TermCollectingRewrite.collectTerms(TermCollectingRewrite.java:67)
       org.apache.lucene.search.TopTermsRewrite.rewrite(TopTermsRewrite.java:67)
       org.apache.lucene.search.MultiTermQuery.rewrite(MultiTermQuery.java:310)
       org.apache.lucene.search.DisjunctionMaxQuery.rewrite(DisjunctionMaxQuery.java:219)
       org.apache.lucene.search.BooleanQuery.rewrite(BooleanQuery.java:246)
       org.apache.lucene.search.IndexSearcher.rewrite(IndexSearcher.java:685)
       org.elasticsearch.search.internal.ContextIndexSearcher.rewrite(ContextIndexSearcher.java:106)
       org.elasticsearch.search.DefaultSearchContext.preProcess(DefaultSearchContext.java:263)
       org.elasticsearch.search.query.QueryPhase.preProcess(QueryPhase.java:91)
       org.elasticsearch.search.SearchService.createContext(SearchService.java:660)
       org.elasticsearch.search.SearchService.createAndPutContext(SearchService.java:599)
       org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:387)
       org.elasticsearch.search.SearchService.access$100(SearchService.java:126)
       org.elasticsearch.search.SearchService$2.onResponse(SearchService.java:359)
       org.elasticsearch.search.SearchService$2.onResponse(SearchService.java:355)
       org.elasticsearch.search.SearchService$4.doRun(SearchService.java:1117)
       org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
       org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:41)
       org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:751)
       org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
       java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
       java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
       java.lang.Thread.run(Thread.java:748)
     unique snapshot
       java.nio.Bits.copyToArray(Bits.java:836)
       java.nio.DirectByteBuffer.get(DirectByteBuffer.java:279)
       org.apache.lucene.store.ByteBufferGuard.getBytes(ByteBufferGuard.java:93)
       org.apache.lucene.store.ByteBufferIndexInput.readBytes(ByteBufferIndexInput.java:89)
       org.apache.lucene.codecs.blocktree.IntersectTermsEnumFrame.load(IntersectTermsEnumFrame.java:194)
       org.apache.lucene.codecs.blocktree.IntersectTermsEnum.<init>(IntersectTermsEnum.java:127)
       org.apache.lucene.codecs.blocktree.FieldReader.intersect(FieldReader.java:188)
       org.apache.lucene.search.FuzzyTermsEnum.getAutomatonEnum(FuzzyTermsEnum.java:169)
       org.apache.lucene.search.FuzzyTermsEnum.bottomChanged(FuzzyTermsEnum.java:196)
       org.apache.lucene.search.FuzzyTermsEnum.<init>(FuzzyTermsEnum.java:151)
       org.apache.lucene.search.FuzzyQuery.getTermsEnum(FuzzyQuery.java:154)
       org.apache.lucene.search.MultiTermQuery$RewriteMethod.getTermsEnum(MultiTermQuery.java:78)
       org.apache.lucene.search.TermCollectingRewrite.collectTerms(TermCollectingRewrite.java:58)
       org.apache.lucene.search.TopTermsRewrite.rewrite(TopTermsRewrite.java:67)
       org.apache.lucene.search.MultiTermQuery.rewrite(MultiTermQuery.java:310)
       org.apache.lucene.search.DisjunctionMaxQuery.rewrite(DisjunctionMaxQuery.java:219)
       org.apache.lucene.search.BooleanQuery.rewrite(BooleanQuery.java:246)
       org.apache.lucene.search.IndexSearcher.rewrite(IndexSearcher.java:685)
       org.elasticsearch.search.internal.ContextIndexSearcher.rewrite(ContextIndexSearcher.java:106)
       org.elasticsearch.search.DefaultSearchContext.preProcess(DefaultSearchContext.java:263)
       org.elasticsearch.search.query.QueryPhase.preProcess(QueryPhase.java:91)
       org.elasticsearch.search.SearchService.createContext(SearchService.java:660)
       org.elasticsearch.search.SearchService.createAndPutContext(SearchService.java:599)
       org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:387)
       org.elasticsearch.search.SearchService.access$100(SearchService.java:126)
       org.elasticsearch.search.SearchService$2.onResponse(SearchService.java:359)
       org.elasticsearch.search.SearchService$2.onResponse(SearchService.java:355)
       org.elasticsearch.search.SearchService$4.doRun(SearchService.java:1117)
       org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
       org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:41)
       org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:751)
       org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
       java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
       java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
       java.lang.Thread.run(Thread.java:748)

instance-0000000002 continued:

57.0% (285.1ms out of 500ms) cpu usage by thread 'elasticsearch[instance-0000000002][search][T#2]'
     9/10 snapshots sharing following 25 elements
       org.apache.lucene.codecs.blocktree.IntersectTermsEnum.next(IntersectTermsEnum.java:497)
       org.apache.lucene.search.FuzzyTermsEnum.next(FuzzyTermsEnum.java:211)
       org.apache.lucene.search.TermCollectingRewrite.collectTerms(TermCollectingRewrite.java:67)
       org.apache.lucene.search.TopTermsRewrite.rewrite(TopTermsRewrite.java:67)
       org.apache.lucene.search.MultiTermQuery.rewrite(MultiTermQuery.java:310)
       org.apache.lucene.search.DisjunctionMaxQuery.rewrite(DisjunctionMaxQuery.java:219)
       org.apache.lucene.search.BooleanQuery.rewrite(BooleanQuery.java:246)
       org.apache.lucene.search.IndexSearcher.rewrite(IndexSearcher.java:685)
       org.elasticsearch.search.internal.ContextIndexSearcher.rewrite(ContextIndexSearcher.java:106)
       org.elasticsearch.search.DefaultSearchContext.preProcess(DefaultSearchContext.java:263)
       org.elasticsearch.search.query.QueryPhase.preProcess(QueryPhase.java:91)
       org.elasticsearch.search.SearchService.createContext(SearchService.java:660)
       org.elasticsearch.search.SearchService.createAndPutContext(SearchService.java:599)
       org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:387)
       org.elasticsearch.search.SearchService.access$100(SearchService.java:126)
       org.elasticsearch.search.SearchService$2.onResponse(SearchService.java:359)
       org.elasticsearch.search.SearchService$2.onResponse(SearchService.java:355)
       org.elasticsearch.search.SearchService$4.doRun(SearchService.java:1117)
       org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
       org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:41)
       org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:751)
       org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
       java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
       java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
       java.lang.Thread.run(Thread.java:748)
     unique snapshot
       java.nio.Bits.copyToArray(Bits.java:836)
       java.nio.DirectByteBuffer.get(DirectByteBuffer.java:279)
       org.apache.lucene.store.ByteBufferGuard.getBytes(ByteBufferGuard.java:93)
       org.apache.lucene.store.ByteBufferIndexInput.readBytes(ByteBufferIndexInput.java:89)
       org.apache.lucene.codecs.blocktree.IntersectTermsEnumFrame.load(IntersectTermsEnumFrame.java:194)
       org.apache.lucene.codecs.blocktree.IntersectTermsEnum.<init>(IntersectTermsEnum.java:127)
       org.apache.lucene.codecs.blocktree.FieldReader.intersect(FieldReader.java:188)
       org.apache.lucene.search.FuzzyTermsEnum.getAutomatonEnum(FuzzyTermsEnum.java:169)
       org.apache.lucene.search.FuzzyTermsEnum.bottomChanged(FuzzyTermsEnum.java:196)
       org.apache.lucene.search.FuzzyTermsEnum.<init>(FuzzyTermsEnum.java:151)
       org.apache.lucene.search.FuzzyQuery.getTermsEnum(FuzzyQuery.java:154)
       org.apache.lucene.search.MultiTermQuery$RewriteMethod.getTermsEnum(MultiTermQuery.java:78)
       org.apache.lucene.search.TermCollectingRewrite.collectTerms(TermCollectingRewrite.java:58)
       org.apache.lucene.search.TopTermsRewrite.rewrite(TopTermsRewrite.java:67)
       org.apache.lucene.search.MultiTermQuery.rewrite(MultiTermQuery.java:310)
       org.apache.lucene.search.DisjunctionMaxQuery.rewrite(DisjunctionMaxQuery.java:219)
       org.apache.lucene.search.BooleanQuery.rewrite(BooleanQuery.java:246)
       org.apache.lucene.search.IndexSearcher.rewrite(IndexSearcher.java:685)
       org.elasticsearch.search.internal.ContextIndexSearcher.rewrite(ContextIndexSearcher.java:106)
       org.elasticsearch.search.DefaultSearchContext.preProcess(DefaultSearchContext.java:263)
       org.elasticsearch.search.query.QueryPhase.preProcess(QueryPhase.java:91)
       org.elasticsearch.search.SearchService.createContext(SearchService.java:660)
       org.elasticsearch.search.SearchService.createAndPutContext(SearchService.java:599)
       org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:387)
       org.elasticsearch.search.SearchService.access$100(SearchService.java:126)
       org.elasticsearch.search.SearchService$2.onResponse(SearchService.java:359)
       org.elasticsearch.search.SearchService$2.onResponse(SearchService.java:355)
       org.elasticsearch.search.SearchService$4.doRun(SearchService.java:1117)
       org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
       org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:41)
       org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:751)
       org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
       java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
       java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
       java.lang.Thread.run(Thread.java:748)

And the 2 other instances

::: {instance-0000000001}{VACHH-0hTn2DkDifEWdw2A}{tPKYBXW0QduX6FQ-jVO1fw}{10.0.30.247}{10.0.30.247:19594}{logical_availability_zone=zone-1, server_name=instance-0000000001.7010a5b641be40388118884e5d60a284, availability_zone=ap-southeast-2b, xpack.installed=true, instance_configuration=aws.data.highio.i3, region=ap-southeast-2}
   Hot threads at 2020-09-04T13:18:53.274Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:

::: {instance-0000000000}{28gdCCBDTV2HMa1RcbU2_Q}{7lBc1d7pQiqYAnxjRI1Mmg}{10.0.10.217}{10.0.10.217:19331}{logical_availability_zone=zone-0, server_name=instance-0000000000.7010a5b641be40388118884e5d60a284, availability_zone=ap-southeast-2a, xpack.installed=true, region=ap-southeast-2, instance_configuration=aws.data.highio.i3}
   Hot threads at 2020-09-04T13:18:53.275Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:

It looks like that node is busy serving an expensive fuzzy search request. Do you have any expensive searches running against indices with shards primarily on this node? Are all your shards and replicas assigned and evenly distributed across the cluster? Do you use preference when you query the cluster? Do all nodes have the same hardware specification?

Yes, all nodes have the same specifications, shards and replicas are evenly distributed, no preference when we query the cluster as far as I can tell.
During the test which resulted in this snapshot (this was on a cluster we created for the occasion, copy from our Prod ), we were only running 2 types of search and 1 type of indexing with different inputs so I can't think of any specific search which would be more complex than another.

Just a thought, what if I have a lot of document deletions going on in parallel, could that be a lead?