Constant High CPU Usage by Elasticsearch

I am having the below setup
Elasticsearch version : 7.5.1
APM Server version : 7.5.1
APM Agent language and version : elasticapm-java/1.12.0
Its a single node cluster, with about 500 shards, with 1 replica
64GB RAM, 32 processor, 3TB storage(55% occupied)

Workload : Inserting documents from APM Server & Logstash & few dashboards for visualisation.

I am observing that the CPU is constantly between 90 to 100%, as a result lot of events are getting lost due to 'queue is full' error. Even basic index list apis take long time to respond.
Until 2 weeks back, things were working fine, with about 80% of current volume.
The GET _nodes/hot_threads API gives below results...I have taken multiple snapshots, & I see the management & write threads coming alternatively.

::: {elk.abc.co.in}{D9AO4ua7QQKowLbUe4-Ffw}{WlOQbHwMQGaDTQ1gELXQDw}{10.10.10.211}{10.10.10.211:9300}{dilm}{ml.machine_memory=67335020544, xpack.installed=true, ml.max_open_jobs=20}
   Hot threads at 2020-07-21T05:43:32.529Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:
   
   181.7% (908.7ms out of 500ms) cpu usage by thread 'elasticsearch[elk.abc.co.in][write][T#6]'
     4/10 snapshots sharing following 54 elements
       app//org.elasticsearch.ingest.CompoundProcessor.innerExecute(CompoundProcessor.java:137)
       app//org.elasticsearch.ingest.CompoundProcessor.lambda$innerExecute$1(CompoundProcessor.java:156)
       app//org.elasticsearch.ingest.CompoundProcessor$$Lambda$5961/0x00007eebb68ea440.accept(Unknown Source)
       app//org.elasticsearch.ingest.IngestDocument.lambda$executePipeline$0(IngestDocument.java:653)
       app//org.elasticsearch.ingest.IngestDocument$$Lambda$5962/0x00007eebb68e7108.accept(Unknown Source)
       app//org.elasticsearch.ingest.Pipeline.lambda$execute$0(Pipeline.java:106)
       app//org.elasticsearch.ingest.Pipeline$$Lambda$5960/0x00007eebb68ecd08.accept(Unknown Source)
       app//org.elasticsearch.ingest.CompoundProcessor.innerExecute(CompoundProcessor.java:128)
     6/10 snapshots sharing following 8 elements
       app//org.elasticsearch.ingest.IngestService.executePipelines(IngestService.java:409)
       app//org.elasticsearch.ingest.IngestService.access$000(IngestService.java:75)
       app//org.elasticsearch.ingest.IngestService$3.doRun(IngestService.java:383)
       app//org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:773)
       app//org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
       java.base@13.0.1/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
       java.base@13.0.1/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
       java.base@13.0.1/java.lang.Thread.run(Thread.java:830)
   
   162.5% (812.6ms out of 500ms) cpu usage by thread 'elasticsearch[elk.abc.co.in][management][T#1]'
     5/10 snapshots sharing following 22 elements
       app//org.apache.lucene.codecs.compressing.CompressingStoredFieldsIndexReader.ramBytesUsed(CompressingStoredFieldsIndexReader.java:180)
       app//org.apache.lucene.codecs.compressing.CompressingStoredFieldsReader.ramBytesUsed(CompressingStoredFieldsReader.java:654)
       app//org.apache.lucene.index.CodecReader.ramBytesUsed(CodecReader.java:228)
       app//org.elasticsearch.index.engine.Engine.fillSegmentStats(Engine.java:837)
       app//org.elasticsearch.index.engine.Engine.segmentsStats(Engine.java:819)
       app//org.elasticsearch.index.shard.IndexShard.segmentStats(IndexShard.java:1043)
       app//org.elasticsearch.action.admin.indices.stats.CommonStats.<init>(CommonStats.java:210)
     2/10 snapshots sharing following 24 elements
       java.base@13.0.1/java.util.TreeMap.getFirstEntry(TreeMap.java:2126)
       java.base@13.0.1/java.util.TreeMap$Values.iterator(TreeMap.java:1028)
       app//org.apache.lucene.codecs.blocktree.BlockTreeTermsReader.ramBytesUsed(BlockTreeTermsReader.java:331)
       app//org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsReader.ramBytesUsed(PerFieldPostingsFormat.java:331)
     3/10 snapshots sharing following 19 elements
       app//org.elasticsearch.index.engine.Engine.fillSegmentStats(Engine.java:837)
       app//org.elasticsearch.index.engine.Engine.segmentsStats(Engine.java:819)
       app//org.elasticsearch.index.shard.IndexShard.segmentStats(IndexShard.java:1043)
       app//org.elasticsearch.action.admin.indices.stats.CommonStats.<init>(CommonStats.java:210)
       app//org.elasticsearch.action.admin.indices.stats.TransportIndicesStatsAction.shardOperation(   
   154.5% (772.4ms out of 500ms) cpu usage by thread 'elasticsearch[elk.abc.co.in][write][T#23]'
     2/10 snapshots sharing following 11 elements
       java.base@13.0.1/java.util.HashMap.putAll(HashMap.java:785)
       app//org.elasticsearch.ingest.IngestDocument.<init>(IngestDocument.java:68)
       app//org.elasticsearch.ingest.IngestService.innerExecute(IngestService.java:503)
       app//org.elasticsearch.ingest.IngestService.executePipelines(IngestService.java:409)
       app//org.elasticsearch.ingest.IngestService.access$000(IngestService.java:75)
     2/10 snapshots sharing following 58 elements
       org.elasticsearch.ingest.common.RemoveProcessor$$Lambda$5963/0x00007eebb68e74b0.accept(Unknown Source)
       java.base@13.0.1/java.util.ArrayList.forEach(ArrayList.java:1507)
       org.elasticsearch.ingest.common.RemoveProcessor.execute(RemoveProcessor.java:57)
       app//org.elasticsearch.ingest.Processor.execute(Processor.java:50)
       app//org.elasticsearch.ingest.CompoundProcessor.innerExecute(CompoundProcesso
     2/10 snapshots sharing following 20 elements
       app//org.elasticsearch.common.xcontent.support.AbstractXContentParser.readMap(AbstractXContentParser.java:322)
       app//org.elasticsearch.common.xcontent.support.AbstractXContentParser.readValue(AbstractXContentParser.java:375)
       app//org.elasticsearch.common.xcontent.support.AbstractXContentParser.lambda$readMap$0(AbstractXContentParser.java:322)
       
     2/10 snapshots sharing following 62 elements
       app//org.elasticsearch.ingest.IngestService.lambda$innerExecute$4(IngestService.java:525)
       app//org.elasticsearch.ingest.IngestService$$Lambda$5959/0x00007eebb68ef840.accept(Unknown Source)
       app//org.elasticsearch.ingest.Pipeline.lambda$execute$0(Pipeline.java:106)
       app//org.elasticsearch.ingest.Pipeline$$Lambda$5960/0x00007eebb68ecd08.accept(Unknown Source)
       app//org.elasticsearch.ingest.CompoundProcessor.innerExecute(CompoundProcessor.java:128)
       
     2/10 snapshots sharing following 37 elements
       app//org.elasticsearch.ingest.CompoundProcessor.execute(CompoundProcessor.java:123)
       app//org.elasticsearch.ingest.Pipeline.execute(Pipeline.java:100)
       app//org.elasticsearch.ingest.IngestDocument.executePipeline(IngestDocument.java:651)
       app//org.elasticsearch.ingest.PipelineProcessor.execute(PipelineProcessor.java:43)
       app//org.elasticsearch.ingest.CompoundProcessor.innerExecute(CompoundProcessor.java:137)
       app//org.elasticsearch.ingest.CompoundProcessor.lambda$innerExecute$1(CompoundProcessor.java:156)
       app//org.elasticsearch.ingest.CompoundProcessor$$Lambda$5961/0x00007eebb68ea440.accept(Unknown Source)
       app//org.elasticsearch.ingest.IngestDocument.lambda$executePipeline$0(IngestDocument.java:653)
       app//org.elasticsearch.ingest.IngestDocument$$Lambda$5962/0x00007eebb68e7108.accept(Unknown Source)
       app//org.elasticsearch.ingest.Pipeline.lambda$execute$0(Pipeline.java:106)
       app//org.elasticsearch.ingest.Pipeline$$Lambda$5960/0x00007eebb68ecd08.accept(Unknown Source)
::: {elk.abc.co.in}{D9AO4ua7QQKowLbUe4-Ffw}{WlOQbHwMQGaDTQ1gELXQDw}{10.10.10.211}{10.10.10.211:9300}{dilm}{ml.machine_memory=67335020544, xpack.installed=true, ml.max_open_jobs=20}
   Hot threads at 2020-07-21T05:33:57.218Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:
   
   141.5% (707.7ms out of 500ms) cpu usage by thread 'elasticsearch[elk.abc.co.in][http_server_worker][T#24]'
     7/10 snapshots sharing following 90 elements
       app//org.elasticsearch.common.xcontent.XContentHelper.convertToMap(XContentHelper.java:114)
       app//org.elasticsearch.common.xcontent.XContentHelper.convertToMap(XContentHelper.java:90)
       app//org.elasticsearch.cluster.metadata.MappingMetaData.sourceAsMap(MappingMetaData.java:149)
       app//org.elasticsearch.action.admin.indices.mapping.get.GetMappingsResponse.toXContent(GetMappingsResponse.java:135)
     3/10 snapshots sharing following 97 elements
       app//org.elasticsearch.common.xcontent.XContentBuilder.unknownValue(XContentBuilder.java:822)
       app//org.elasticsearch.common.xcontent.XContentBuilder.map(XContentBuilder.java:895)
       app//org.elasticsearch.common.xcontent.XContentBuilder.unknownValue(XContentBuilder.java:822)
       app//org.elasticsearch.common.xcontent.XContentBuilder.map(XContentBuilder.java:895)
       app//org.elasticsearch.common.xcontent.XContentBuilder.unknownValue(XContentBuilder.java:822)
       
   140.5% (702.7ms out of 500ms) cpu usage by thread 'elasticsearch[elk.abc.co.in][management][T#4]'
     2/10 snapshots sharing following 39 elements
       app//org.elasticsearch.bootstrap.ESPolicy.implies(ESPolicy.java:102)
       java.base@13.0.1/java.security.ProtectionDomain.implies(ProtectionDomain.java:321)
       java.base@13.0.1/java.security.ProtectionDomain.impliesWithAltFilePerm(ProtectionDomain.java:353)
       java.base@13.0.1/java.security.AccessControlContext.checkPermission(AccessControlContext.java:450)
       java.base@13.0.1/java.security.AccessController.checkPermission(AccessController.java:1036)
       java.base@13.0.1/java.lang.SecurityManager.checkPermission(SecurityManager.java:408)
       java.base@13.0.1/java.lang.SecurityManager.checkRead(SecurityManager.java:747)
       
     8/10 snapshots sharing following 15 elements
       app//org.elasticsearch.action.admin.cluster.stats.TransportClusterStatsAction.nodeOperation(TransportClusterStatsAction.java:120)
       app//org.elasticsearch.action.admin.cluster.stats.TransportClusterStatsAction.nodeOperation(TransportClusterStatsAction.java:52)
       app//org.elasticsearch.action.support.nodes.TransportNodesAction.nodeOperation(TransportNodesAction.java:129)
       app//org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:244)
       
   134.4% (672.2ms out of 500ms) cpu usage by thread 'elasticsearch[elk.abc.co.in][management][T#1]'
     3/10 snapshots sharing following 19 elements
       app//org.elasticsearch.index.engine.Engine.fillSegmentStats(Engine.java:837)
       app//org.elasticsearch.index.engine.Engine.segmentsStats(Engine.java:819)
       app//org.elasticsearch.index.shard.IndexShard.segmentStats(IndexShard.java:1043)
       app//org.elasticsearch.action.admin.indices.stats.CommonStats.<init>(CommonStats.java:210)
       app//org.elasticsearch.action.admin.cluster.stats.TransportClusterStatsAction.nodeOperation(
     7/10 snapshots sharing following 15 elements
       app//org.elasticsearch.action.admin.cluster.stats.TransportClusterStatsAction.nodeOperation(TransportClusterStatsAction.java:120)
       app//org.elasticsearch.action.admin.cluster.stats.TransportClusterStatsAction.nodeOperation(TransportClusterStatsAction.java:52)
       app//org.elasticsearch.action.support.nodes.TransportNodesAction.nodeOperation(TransportNodesAction.java:129)
       app//org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:244)

how frequently your snapshot is scheduled to run? and how much of data ingestion is happening per day?

by snapshot, I meant the hot_threads output, which I have taken manually within few mins interval.
data ingestion is ~450M in a day spread across ~30 indexes

Hi Team,

any help here ? I want to profile where my Elasticsearch is using the CPU for ?
I am not able to understand much from the hot_threads api, is there any other way ?
I tried using jmx in the jvm.options file, but es did not start with it ? is there an option to enable jmx monitoring ?

Need quick help. pls suggest.

This topic was automatically closed 20 days after the last reply. New replies are no longer allowed.