I am having the below setup
Elasticsearch version : 7.5.1
APM Server version : 7.5.1
APM Agent language and version : elasticapm-java/1.12.0
Its a single node cluster, with about 500 shards, with 1 replica
64GB RAM, 32 processor, 3TB storage(55% occupied)
Workload : Inserting documents from APM Server & Logstash & few dashboards for visualisation.
I am observing that the CPU is constantly between 90 to 100%, as a result lot of events are getting lost due to 'queue is full' error. Even basic index list apis take long time to respond.
Until 2 weeks back, things were working fine, with about 80% of current volume.
The GET _nodes/hot_threads API gives below results...I have taken multiple snapshots, & I see the management & write threads coming alternatively.
::: {elk.abc.co.in}{D9AO4ua7QQKowLbUe4-Ffw}{WlOQbHwMQGaDTQ1gELXQDw}{10.10.10.211}{10.10.10.211:9300}{dilm}{ml.machine_memory=67335020544, xpack.installed=true, ml.max_open_jobs=20}
Hot threads at 2020-07-21T05:43:32.529Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:
181.7% (908.7ms out of 500ms) cpu usage by thread 'elasticsearch[elk.abc.co.in][write][T#6]'
4/10 snapshots sharing following 54 elements
app//org.elasticsearch.ingest.CompoundProcessor.innerExecute(CompoundProcessor.java:137)
app//org.elasticsearch.ingest.CompoundProcessor.lambda$innerExecute$1(CompoundProcessor.java:156)
app//org.elasticsearch.ingest.CompoundProcessor$$Lambda$5961/0x00007eebb68ea440.accept(Unknown Source)
app//org.elasticsearch.ingest.IngestDocument.lambda$executePipeline$0(IngestDocument.java:653)
app//org.elasticsearch.ingest.IngestDocument$$Lambda$5962/0x00007eebb68e7108.accept(Unknown Source)
app//org.elasticsearch.ingest.Pipeline.lambda$execute$0(Pipeline.java:106)
app//org.elasticsearch.ingest.Pipeline$$Lambda$5960/0x00007eebb68ecd08.accept(Unknown Source)
app//org.elasticsearch.ingest.CompoundProcessor.innerExecute(CompoundProcessor.java:128)
6/10 snapshots sharing following 8 elements
app//org.elasticsearch.ingest.IngestService.executePipelines(IngestService.java:409)
app//org.elasticsearch.ingest.IngestService.access$000(IngestService.java:75)
app//org.elasticsearch.ingest.IngestService$3.doRun(IngestService.java:383)
app//org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:773)
app//org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
java.base@13.0.1/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
java.base@13.0.1/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
java.base@13.0.1/java.lang.Thread.run(Thread.java:830)
162.5% (812.6ms out of 500ms) cpu usage by thread 'elasticsearch[elk.abc.co.in][management][T#1]'
5/10 snapshots sharing following 22 elements
app//org.apache.lucene.codecs.compressing.CompressingStoredFieldsIndexReader.ramBytesUsed(CompressingStoredFieldsIndexReader.java:180)
app//org.apache.lucene.codecs.compressing.CompressingStoredFieldsReader.ramBytesUsed(CompressingStoredFieldsReader.java:654)
app//org.apache.lucene.index.CodecReader.ramBytesUsed(CodecReader.java:228)
app//org.elasticsearch.index.engine.Engine.fillSegmentStats(Engine.java:837)
app//org.elasticsearch.index.engine.Engine.segmentsStats(Engine.java:819)
app//org.elasticsearch.index.shard.IndexShard.segmentStats(IndexShard.java:1043)
app//org.elasticsearch.action.admin.indices.stats.CommonStats.<init>(CommonStats.java:210)
2/10 snapshots sharing following 24 elements
java.base@13.0.1/java.util.TreeMap.getFirstEntry(TreeMap.java:2126)
java.base@13.0.1/java.util.TreeMap$Values.iterator(TreeMap.java:1028)
app//org.apache.lucene.codecs.blocktree.BlockTreeTermsReader.ramBytesUsed(BlockTreeTermsReader.java:331)
app//org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsReader.ramBytesUsed(PerFieldPostingsFormat.java:331)
3/10 snapshots sharing following 19 elements
app//org.elasticsearch.index.engine.Engine.fillSegmentStats(Engine.java:837)
app//org.elasticsearch.index.engine.Engine.segmentsStats(Engine.java:819)
app//org.elasticsearch.index.shard.IndexShard.segmentStats(IndexShard.java:1043)
app//org.elasticsearch.action.admin.indices.stats.CommonStats.<init>(CommonStats.java:210)
app//org.elasticsearch.action.admin.indices.stats.TransportIndicesStatsAction.shardOperation(
154.5% (772.4ms out of 500ms) cpu usage by thread 'elasticsearch[elk.abc.co.in][write][T#23]'
2/10 snapshots sharing following 11 elements
java.base@13.0.1/java.util.HashMap.putAll(HashMap.java:785)
app//org.elasticsearch.ingest.IngestDocument.<init>(IngestDocument.java:68)
app//org.elasticsearch.ingest.IngestService.innerExecute(IngestService.java:503)
app//org.elasticsearch.ingest.IngestService.executePipelines(IngestService.java:409)
app//org.elasticsearch.ingest.IngestService.access$000(IngestService.java:75)
2/10 snapshots sharing following 58 elements
org.elasticsearch.ingest.common.RemoveProcessor$$Lambda$5963/0x00007eebb68e74b0.accept(Unknown Source)
java.base@13.0.1/java.util.ArrayList.forEach(ArrayList.java:1507)
org.elasticsearch.ingest.common.RemoveProcessor.execute(RemoveProcessor.java:57)
app//org.elasticsearch.ingest.Processor.execute(Processor.java:50)
app//org.elasticsearch.ingest.CompoundProcessor.innerExecute(CompoundProcesso
2/10 snapshots sharing following 20 elements
app//org.elasticsearch.common.xcontent.support.AbstractXContentParser.readMap(AbstractXContentParser.java:322)
app//org.elasticsearch.common.xcontent.support.AbstractXContentParser.readValue(AbstractXContentParser.java:375)
app//org.elasticsearch.common.xcontent.support.AbstractXContentParser.lambda$readMap$0(AbstractXContentParser.java:322)
2/10 snapshots sharing following 62 elements
app//org.elasticsearch.ingest.IngestService.lambda$innerExecute$4(IngestService.java:525)
app//org.elasticsearch.ingest.IngestService$$Lambda$5959/0x00007eebb68ef840.accept(Unknown Source)
app//org.elasticsearch.ingest.Pipeline.lambda$execute$0(Pipeline.java:106)
app//org.elasticsearch.ingest.Pipeline$$Lambda$5960/0x00007eebb68ecd08.accept(Unknown Source)
app//org.elasticsearch.ingest.CompoundProcessor.innerExecute(CompoundProcessor.java:128)
2/10 snapshots sharing following 37 elements
app//org.elasticsearch.ingest.CompoundProcessor.execute(CompoundProcessor.java:123)
app//org.elasticsearch.ingest.Pipeline.execute(Pipeline.java:100)
app//org.elasticsearch.ingest.IngestDocument.executePipeline(IngestDocument.java:651)
app//org.elasticsearch.ingest.PipelineProcessor.execute(PipelineProcessor.java:43)
app//org.elasticsearch.ingest.CompoundProcessor.innerExecute(CompoundProcessor.java:137)
app//org.elasticsearch.ingest.CompoundProcessor.lambda$innerExecute$1(CompoundProcessor.java:156)
app//org.elasticsearch.ingest.CompoundProcessor$$Lambda$5961/0x00007eebb68ea440.accept(Unknown Source)
app//org.elasticsearch.ingest.IngestDocument.lambda$executePipeline$0(IngestDocument.java:653)
app//org.elasticsearch.ingest.IngestDocument$$Lambda$5962/0x00007eebb68e7108.accept(Unknown Source)
app//org.elasticsearch.ingest.Pipeline.lambda$execute$0(Pipeline.java:106)
app//org.elasticsearch.ingest.Pipeline$$Lambda$5960/0x00007eebb68ecd08.accept(Unknown Source)
::: {elk.abc.co.in}{D9AO4ua7QQKowLbUe4-Ffw}{WlOQbHwMQGaDTQ1gELXQDw}{10.10.10.211}{10.10.10.211:9300}{dilm}{ml.machine_memory=67335020544, xpack.installed=true, ml.max_open_jobs=20}
Hot threads at 2020-07-21T05:33:57.218Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:
141.5% (707.7ms out of 500ms) cpu usage by thread 'elasticsearch[elk.abc.co.in][http_server_worker][T#24]'
7/10 snapshots sharing following 90 elements
app//org.elasticsearch.common.xcontent.XContentHelper.convertToMap(XContentHelper.java:114)
app//org.elasticsearch.common.xcontent.XContentHelper.convertToMap(XContentHelper.java:90)
app//org.elasticsearch.cluster.metadata.MappingMetaData.sourceAsMap(MappingMetaData.java:149)
app//org.elasticsearch.action.admin.indices.mapping.get.GetMappingsResponse.toXContent(GetMappingsResponse.java:135)
3/10 snapshots sharing following 97 elements
app//org.elasticsearch.common.xcontent.XContentBuilder.unknownValue(XContentBuilder.java:822)
app//org.elasticsearch.common.xcontent.XContentBuilder.map(XContentBuilder.java:895)
app//org.elasticsearch.common.xcontent.XContentBuilder.unknownValue(XContentBuilder.java:822)
app//org.elasticsearch.common.xcontent.XContentBuilder.map(XContentBuilder.java:895)
app//org.elasticsearch.common.xcontent.XContentBuilder.unknownValue(XContentBuilder.java:822)
140.5% (702.7ms out of 500ms) cpu usage by thread 'elasticsearch[elk.abc.co.in][management][T#4]'
2/10 snapshots sharing following 39 elements
app//org.elasticsearch.bootstrap.ESPolicy.implies(ESPolicy.java:102)
java.base@13.0.1/java.security.ProtectionDomain.implies(ProtectionDomain.java:321)
java.base@13.0.1/java.security.ProtectionDomain.impliesWithAltFilePerm(ProtectionDomain.java:353)
java.base@13.0.1/java.security.AccessControlContext.checkPermission(AccessControlContext.java:450)
java.base@13.0.1/java.security.AccessController.checkPermission(AccessController.java:1036)
java.base@13.0.1/java.lang.SecurityManager.checkPermission(SecurityManager.java:408)
java.base@13.0.1/java.lang.SecurityManager.checkRead(SecurityManager.java:747)
8/10 snapshots sharing following 15 elements
app//org.elasticsearch.action.admin.cluster.stats.TransportClusterStatsAction.nodeOperation(TransportClusterStatsAction.java:120)
app//org.elasticsearch.action.admin.cluster.stats.TransportClusterStatsAction.nodeOperation(TransportClusterStatsAction.java:52)
app//org.elasticsearch.action.support.nodes.TransportNodesAction.nodeOperation(TransportNodesAction.java:129)
app//org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:244)
134.4% (672.2ms out of 500ms) cpu usage by thread 'elasticsearch[elk.abc.co.in][management][T#1]'
3/10 snapshots sharing following 19 elements
app//org.elasticsearch.index.engine.Engine.fillSegmentStats(Engine.java:837)
app//org.elasticsearch.index.engine.Engine.segmentsStats(Engine.java:819)
app//org.elasticsearch.index.shard.IndexShard.segmentStats(IndexShard.java:1043)
app//org.elasticsearch.action.admin.indices.stats.CommonStats.<init>(CommonStats.java:210)
app//org.elasticsearch.action.admin.cluster.stats.TransportClusterStatsAction.nodeOperation(
7/10 snapshots sharing following 15 elements
app//org.elasticsearch.action.admin.cluster.stats.TransportClusterStatsAction.nodeOperation(TransportClusterStatsAction.java:120)
app//org.elasticsearch.action.admin.cluster.stats.TransportClusterStatsAction.nodeOperation(TransportClusterStatsAction.java:52)
app//org.elasticsearch.action.support.nodes.TransportNodesAction.nodeOperation(TransportNodesAction.java:129)
app//org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:244)