HI everyone
We've been trying to get the the bottom of this issue for a few months now with no luck, would really appropriate some support
The issue
When querying our Wazuh Alert indexes we see 100% CPU usage across all cores on all DATA nodes in the cluster.
The query will often time out (Our time out period is 90 seconds) before any results are returned. We see query latency times of 7,000 to 10,000 milliseconds during this time.
Hot threads output:
It wouldn't let me paste all 250,000 characters lol, see full output at https://p.lee.io/2c890a25-42cc-402a-ae32-4af283726784#encryptionKey=GKRotxohBy84YEcAFYUWubLs
::: {MASTER-02}{Wf05XAPGTcatTslNp3vr8w}{FBeA4d4HS5C-EVpA9vkl-w}{192.168.63.17}{192.168.63.17:9300}{lm}{ml.machine_memory=16637591552, rack=MAN4, ml.max_open_jobs=20, xpack.installed=true}
Hot threads at 2020-03-13T16:46:49.495Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:
::: {DATA-06}{ZET2Ar1DRpOZzVtrLg2KBw}{S8tbmpVGT5m78zk8iey2-w}{192.168.63.41}{192.168.63.41:9300}{dil}{heat=hot, ml.machine_memory=33566724096, rack=MAN5, xpack.installed=true, ml.max_open_jobs=20}
Hot threads at 2020-03-13T16:45:59.010Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:
93.7% (468.4ms out of 500ms) cpu usage by thread 'elasticsearch[DATA-06][search][T#2]'
7/10 snapshots sharing following 46 elements
java.base@11.0.6/java.nio.ByteBuffer.get(ByteBuffer.java:737)
java.base@11.0.6/java.nio.DirectByteBuffer.get(DirectByteBuffer.java:319)
app//org.apache.lucene.store.ByteBufferGuard.getBytes(ByteBufferGuard.java:94)
app//org.apache.lucene.store.ByteBufferIndexInput.readBytes(ByteBufferIndexInput.java:99)
app//org.apache.lucene.codecs.lucene80.Lucene80DocValuesProducer$TermsDict.next(Lucene80DocValuesProducer.java:972)
app//org.apache.lucene.codecs.lucene80.Lucene80DocValuesProducer$TermsDict.seekExact(Lucene80DocValuesProducer.java:987)
app//org.apache.lucene.codecs.lucene80.Lucene80DocValuesProducer$BaseSortedSetDocValues.lookupOrd(Lucene80DocValuesProducer.java:909)
app//org.elasticsearch.index.fielddata.FieldData$10.nextValue(FieldData.java:367)
app//org.elasticsearch.search.aggregations.bucket.terms.StringTermsAggregator$1.collect(StringTermsAggregator.java:92)
app//org.elasticsearch.search.aggregations.LeafBucketCollector.collect(LeafBucketCollector.java:82)
app//org.apache.lucene.search.MatchAllDocsQuery$1$1.score(MatchAllDocsQuery.java:64)
app//org.elasticsearch.search.internal.CancellableBulkScorer.score(CancellableBulkScorer.java:56)
app//org.apache.lucene.search.BulkScorer.score(BulkScorer.java:39)
app//org.elasticsearch.search.internal.ContextIndexSearcher.searchLeaf(ContextIndexSearcher.java:198)
app//org.elasticsearch.search.internal.ContextIndexSearcher.search(ContextIndexSearcher.java:171)
app//org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:445)
app//org.elasticsearch.search.query.QueryPhase.searchWithCollector(QueryPhase.java:333)
app//org.elasticsearch.search.query.QueryPhase.executeInternal(QueryPhase.java:295)
app//org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:134)
app//org.elasticsearch.indices.IndicesService.lambda$loadIntoContext$19(IndicesService.java:1335)
app//org.elasticsearch.indices.IndicesService$$Lambda$5045/0x0000000840f89c40.accept(Unknown Source)
app//org.elasticsearch.indices.IndicesService.lambda$cacheShardLevelResult$20(IndicesService.java:1392)
app//org.elasticsearch.indices.IndicesService$$Lambda$5046/0x0000000840f8a040.get(Unknown Source)
app//org.elasticsearch.indices.IndicesRequestCache$Loader.load(IndicesRequestCache.java:174)
app//org.elasticsearch.indices.IndicesRequestCache$Loader.load(IndicesRequestCache.java:157)
app//org.elasticsearch.common.cache.Cache.computeIfAbsent(Cache.java:433)
app//org.elasticsearch.indices.IndicesRequestCache.getOrCompute(IndicesRequestCache.java:123)
app//org.elasticsearch.indices.IndicesService.cacheShardLevelResult(IndicesService.java:1398)
app//org.elasticsearch.indices.IndicesService.loadIntoContext(IndicesService.java:1332)
app//org.elasticsearch.search.SearchService.loadOrExecuteQueryPhase(SearchService.java:336)
app//org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:358)
app//org.elasticsearch.search.SearchService.lambda$executeQueryPhase$1(SearchService.java:343)
app//org.elasticsearch.search.SearchService$$Lambda$4827/0x0000000840f13c40.apply(Unknown Source)
app//org.elasticsearch.action.ActionListener.lambda$map$2(ActionListener.java:146)
app//org.elasticsearch.action.ActionListener$$Lambda$4021/0x0000000840c65040.accept(Unknown Source)
app//org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:63)
app//org.elasticsearch.action.ActionRunnable.lambda$supply$0(ActionRunnable.java:58)
app//org.elasticsearch.action.ActionRunnable$$Lambda$4839/0x0000000840f10840.accept(Unknown Source)
app//org.elasticsearch.action.ActionRunnable$2.doRun(ActionRunnable.java:73)
app//org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
app//org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:44)
app//org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:692)
app//org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
java.base@11.0.6/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
java.base@11.0.6/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
java.base@11.0.6/java.lang.Thread.run(Thread.java:834)
3/10 snapshots sharing following 42 elements
app//org.apache.lucene.codecs.lucene80.Lucene80DocValuesProducer$TermsDict.next(Lucene80DocValuesProducer.java:960)
app//org.apache.lucene.codecs.lucene80.Lucene80DocValuesProducer$TermsDict.seekExact(Lucene80DocValuesProducer.java:987)
app//org.apache.lucene.codecs.lucene80.Lucene80DocValuesProducer$BaseSortedSetDocValues.lookupOrd(Lucene80DocValuesProducer.java:909)
app//org.elasticsearch.index.fielddata.FieldData$10.nextValue(FieldData.java:367)
app//org.elasticsearch.search.aggregations.bucket.terms.StringTermsAggregator$1.collect(StringTermsAggregator.java:92)
app//org.elasticsearch.search.aggregations.LeafBucketCollector.collect(LeafBucketCollector.java:82)
app//org.apache.lucene.search.MatchAllDocsQuery$1$1.score(MatchAllDocsQuery.java:64)
app//org.elasticsearch.search.internal.CancellableBulkScorer.score(CancellableBulkScorer.java:56)
app//org.apache.lucene.search.BulkScorer.score(BulkScorer.java:39)
app//org.elasticsearch.search.internal.ContextIndexSearcher.searchLeaf(ContextIndexSearcher.java:198)
app//org.elasticsearch.search.internal.ContextIndexSearcher.search(ContextIndexSearcher.java:171)
app//org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:445)
app//org.elasticsearch.search.query.QueryPhase.searchWithCollector(QueryPhase.java:333)
app//org.elasticsearch.search.query.QueryPhase.executeInternal(QueryPhase.java:295)
app//org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:134)
app//org.elasticsearch.indices.IndicesService.lambda$loadIntoContext$19(IndicesService.java:1335)
app//org.elasticsearch.indices.IndicesService$$Lambda$5045/0x0000000840f89c40.accept(Unknown Source)
app//org.elasticsearch.indices.IndicesService.lambda$cacheShardLevelResult$20(IndicesService.java:1392)
app//org.elasticsearch.indices.IndicesService$$Lambda$5046/0x0000000840f8a040.get(Unknown Source)
app//org.elasticsearch.indices.IndicesRequestCache$Loader.load(IndicesRequestCache.java:174)
app//org.elasticsearch.indices.IndicesRequestCache$Loader.load(IndicesRequestCache.java:157)
app//org.elasticsearch.common.cache.Cache.computeIfAbsent(Cache.java:433)
app//org.elasticsearch.indices.IndicesRequestCache.getOrCompute(IndicesRequestCache.java:123)
app//org.elasticsearch.indices.IndicesService.cacheShardLevelResult(IndicesService.java:1398)
app//org.elasticsearch.indices.IndicesService.loadIntoContext(IndicesService.java:1332)
app//org.elasticsearch.search.SearchService.loadOrExecuteQueryPhase(SearchService.java:336)
app//org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:358)
app//org.elasticsearch.search.SearchService.lambda$executeQueryPhase$1(SearchService.java:343)
app//org.elasticsearch.search.SearchService$$Lambda$4827/0x0000000840f13c40.apply(Unknown Source)
app//org.elasticsearch.action.ActionListener.lambda$map$2(ActionListener.java:146)
app//org.elasticsearch.action.ActionListener$$Lambda$4021/0x0000000840c65040.accept(Unknown Source)
app//org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:63)
app//org.elasticsearch.action.ActionRunnable.lambda$supply$0(ActionRunnable.java:58)
app//org.elasticsearch.action.ActionRunnable$$Lambda$4839/0x0000000840f10840.accept(Unknown Source)
app//org.elasticsearch.action.ActionRunnable$2.doRun(ActionRunnable.java:73)
app//org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
app//org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:44)
app//org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:692)
app//org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
java.base@11.0.6/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
java.base@11.0.6/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
java.base@11.0.6/java.lang.Thread.run(Thread.java:834)
91.9% (459.6ms out of 500ms) cpu usage by thread 'elasticsearch[DATA-06][search][T#9]'
8/10 snapshots sharing following 42 elements
We have PCIe NVME disks fort write cache, so we see index times of 0.3 to 0.7 milliseconds with an indexing rate of around 2,000 EPS, (about 1,000 EPS of that is the Wazuh Alerts )
Background and cluster information
- 12 node elasticSearch cluster (8 DATA, 3 MASTER, 1 COORDINATING)
- CentOS 7.7
- VMWare VMs with across 6 Hosts
- DATA Nodes have 8 VCPUs 32GB Memory 2TB Disk
- Primary stores Wazuh alerts
- New wazuh-alerts index created every day, with 3 primaries 1 replica. (Was 8 primaries until about 4 days ago, decreased due to less data ingest as a result of other platform improvements)
- Each index (3 primaries ones) is about 12GB with about 10,000,000 documents.
- Indexes get force merged after 1 day, shrunk after 90, frozen after 120.
Cluster health output
{
"cluster_name" : "REDACTED",
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 12,
"number_of_data_nodes" : 8,
"active_primary_shards" : 2186,
"active_shards" : 3873,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 100.0
}
Things we're implemented to try and fix
- Moved to OpenJDK-11
- Moved to G1GC garbage collector
G1GC Config:
-XX:-UseConcMarkSweepGC
-XX:+UseCondCardMark
-XX:MaxGCPauseMillis=200
-XX:+UseG1GC
-XX:GCPauseIntervalMillis=1000
-XX:InitiatingHeapOccupancyPercent=35
We would really appreciate some assistance with this, if you need anymore information please do let me know
Thanks in advance.