100% CPU accross all 8 nodes during query

bradley.king · March 13, 2020, 5:02pm

HI everyone

We've been trying to get the the bottom of this issue for a few months now with no luck, would really appropriate some support

The issue

When querying our Wazuh Alert indexes we see 100% CPU usage across all cores on all DATA nodes in the cluster.

The query will often time out (Our time out period is 90 seconds) before any results are returned. We see query latency times of 7,000 to 10,000 milliseconds during this time.

Hot threads output:

It wouldn't let me paste all 250,000 characters lol, see full output at https://p.lee.io/2c890a25-42cc-402a-ae32-4af283726784#encryptionKey=GKRotxohBy84YEcAFYUWubLs

::: {MASTER-02}{Wf05XAPGTcatTslNp3vr8w}{FBeA4d4HS5C-EVpA9vkl-w}{192.168.63.17}{192.168.63.17:9300}{lm}{ml.machine_memory=16637591552, rack=MAN4, ml.max_open_jobs=20, xpack.installed=true}
   Hot threads at 2020-03-13T16:46:49.495Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:

::: {DATA-06}{ZET2Ar1DRpOZzVtrLg2KBw}{S8tbmpVGT5m78zk8iey2-w}{192.168.63.41}{192.168.63.41:9300}{dil}{heat=hot, ml.machine_memory=33566724096, rack=MAN5, xpack.installed=true, ml.max_open_jobs=20}
   Hot threads at 2020-03-13T16:45:59.010Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:
   
   93.7% (468.4ms out of 500ms) cpu usage by thread 'elasticsearch[DATA-06][search][T#2]'
     7/10 snapshots sharing following 46 elements
       java.base@11.0.6/java.nio.ByteBuffer.get(ByteBuffer.java:737)
       java.base@11.0.6/java.nio.DirectByteBuffer.get(DirectByteBuffer.java:319)
       app//org.apache.lucene.store.ByteBufferGuard.getBytes(ByteBufferGuard.java:94)
       app//org.apache.lucene.store.ByteBufferIndexInput.readBytes(ByteBufferIndexInput.java:99)
       app//org.apache.lucene.codecs.lucene80.Lucene80DocValuesProducer$TermsDict.next(Lucene80DocValuesProducer.java:972)
       app//org.apache.lucene.codecs.lucene80.Lucene80DocValuesProducer$TermsDict.seekExact(Lucene80DocValuesProducer.java:987)
       app//org.apache.lucene.codecs.lucene80.Lucene80DocValuesProducer$BaseSortedSetDocValues.lookupOrd(Lucene80DocValuesProducer.java:909)
       app//org.elasticsearch.index.fielddata.FieldData$10.nextValue(FieldData.java:367)
       app//org.elasticsearch.search.aggregations.bucket.terms.StringTermsAggregator$1.collect(StringTermsAggregator.java:92)
       app//org.elasticsearch.search.aggregations.LeafBucketCollector.collect(LeafBucketCollector.java:82)
       app//org.apache.lucene.search.MatchAllDocsQuery$1$1.score(MatchAllDocsQuery.java:64)
       app//org.elasticsearch.search.internal.CancellableBulkScorer.score(CancellableBulkScorer.java:56)
       app//org.apache.lucene.search.BulkScorer.score(BulkScorer.java:39)
       app//org.elasticsearch.search.internal.ContextIndexSearcher.searchLeaf(ContextIndexSearcher.java:198)
       app//org.elasticsearch.search.internal.ContextIndexSearcher.search(ContextIndexSearcher.java:171)
       app//org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:445)
       app//org.elasticsearch.search.query.QueryPhase.searchWithCollector(QueryPhase.java:333)
       app//org.elasticsearch.search.query.QueryPhase.executeInternal(QueryPhase.java:295)
       app//org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:134)
       app//org.elasticsearch.indices.IndicesService.lambda$loadIntoContext$19(IndicesService.java:1335)
       app//org.elasticsearch.indices.IndicesService$$Lambda$5045/0x0000000840f89c40.accept(Unknown Source)
       app//org.elasticsearch.indices.IndicesService.lambda$cacheShardLevelResult$20(IndicesService.java:1392)
       app//org.elasticsearch.indices.IndicesService$$Lambda$5046/0x0000000840f8a040.get(Unknown Source)
       app//org.elasticsearch.indices.IndicesRequestCache$Loader.load(IndicesRequestCache.java:174)
       app//org.elasticsearch.indices.IndicesRequestCache$Loader.load(IndicesRequestCache.java:157)
       app//org.elasticsearch.common.cache.Cache.computeIfAbsent(Cache.java:433)
       app//org.elasticsearch.indices.IndicesRequestCache.getOrCompute(IndicesRequestCache.java:123)
       app//org.elasticsearch.indices.IndicesService.cacheShardLevelResult(IndicesService.java:1398)
       app//org.elasticsearch.indices.IndicesService.loadIntoContext(IndicesService.java:1332)
       app//org.elasticsearch.search.SearchService.loadOrExecuteQueryPhase(SearchService.java:336)
       app//org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:358)
       app//org.elasticsearch.search.SearchService.lambda$executeQueryPhase$1(SearchService.java:343)
       app//org.elasticsearch.search.SearchService$$Lambda$4827/0x0000000840f13c40.apply(Unknown Source)
       app//org.elasticsearch.action.ActionListener.lambda$map$2(ActionListener.java:146)
       app//org.elasticsearch.action.ActionListener$$Lambda$4021/0x0000000840c65040.accept(Unknown Source)
       app//org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:63)
       app//org.elasticsearch.action.ActionRunnable.lambda$supply$0(ActionRunnable.java:58)
       app//org.elasticsearch.action.ActionRunnable$$Lambda$4839/0x0000000840f10840.accept(Unknown Source)
       app//org.elasticsearch.action.ActionRunnable$2.doRun(ActionRunnable.java:73)
       app//org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
       app//org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:44)
       app//org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:692)
       app//org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
       java.base@11.0.6/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
       java.base@11.0.6/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
       java.base@11.0.6/java.lang.Thread.run(Thread.java:834)
     3/10 snapshots sharing following 42 elements
       app//org.apache.lucene.codecs.lucene80.Lucene80DocValuesProducer$TermsDict.next(Lucene80DocValuesProducer.java:960)
       app//org.apache.lucene.codecs.lucene80.Lucene80DocValuesProducer$TermsDict.seekExact(Lucene80DocValuesProducer.java:987)
       app//org.apache.lucene.codecs.lucene80.Lucene80DocValuesProducer$BaseSortedSetDocValues.lookupOrd(Lucene80DocValuesProducer.java:909)
       app//org.elasticsearch.index.fielddata.FieldData$10.nextValue(FieldData.java:367)
       app//org.elasticsearch.search.aggregations.bucket.terms.StringTermsAggregator$1.collect(StringTermsAggregator.java:92)
       app//org.elasticsearch.search.aggregations.LeafBucketCollector.collect(LeafBucketCollector.java:82)
       app//org.apache.lucene.search.MatchAllDocsQuery$1$1.score(MatchAllDocsQuery.java:64)
       app//org.elasticsearch.search.internal.CancellableBulkScorer.score(CancellableBulkScorer.java:56)
       app//org.apache.lucene.search.BulkScorer.score(BulkScorer.java:39)
       app//org.elasticsearch.search.internal.ContextIndexSearcher.searchLeaf(ContextIndexSearcher.java:198)
       app//org.elasticsearch.search.internal.ContextIndexSearcher.search(ContextIndexSearcher.java:171)
       app//org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:445)
       app//org.elasticsearch.search.query.QueryPhase.searchWithCollector(QueryPhase.java:333)
       app//org.elasticsearch.search.query.QueryPhase.executeInternal(QueryPhase.java:295)
       app//org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:134)
       app//org.elasticsearch.indices.IndicesService.lambda$loadIntoContext$19(IndicesService.java:1335)
       app//org.elasticsearch.indices.IndicesService$$Lambda$5045/0x0000000840f89c40.accept(Unknown Source)
       app//org.elasticsearch.indices.IndicesService.lambda$cacheShardLevelResult$20(IndicesService.java:1392)
       app//org.elasticsearch.indices.IndicesService$$Lambda$5046/0x0000000840f8a040.get(Unknown Source)
       app//org.elasticsearch.indices.IndicesRequestCache$Loader.load(IndicesRequestCache.java:174)
       app//org.elasticsearch.indices.IndicesRequestCache$Loader.load(IndicesRequestCache.java:157)
       app//org.elasticsearch.common.cache.Cache.computeIfAbsent(Cache.java:433)
       app//org.elasticsearch.indices.IndicesRequestCache.getOrCompute(IndicesRequestCache.java:123)
       app//org.elasticsearch.indices.IndicesService.cacheShardLevelResult(IndicesService.java:1398)
       app//org.elasticsearch.indices.IndicesService.loadIntoContext(IndicesService.java:1332)
       app//org.elasticsearch.search.SearchService.loadOrExecuteQueryPhase(SearchService.java:336)
       app//org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:358)
       app//org.elasticsearch.search.SearchService.lambda$executeQueryPhase$1(SearchService.java:343)
       app//org.elasticsearch.search.SearchService$$Lambda$4827/0x0000000840f13c40.apply(Unknown Source)
       app//org.elasticsearch.action.ActionListener.lambda$map$2(ActionListener.java:146)
       app//org.elasticsearch.action.ActionListener$$Lambda$4021/0x0000000840c65040.accept(Unknown Source)
       app//org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:63)
       app//org.elasticsearch.action.ActionRunnable.lambda$supply$0(ActionRunnable.java:58)
       app//org.elasticsearch.action.ActionRunnable$$Lambda$4839/0x0000000840f10840.accept(Unknown Source)
       app//org.elasticsearch.action.ActionRunnable$2.doRun(ActionRunnable.java:73)
       app//org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
       app//org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:44)
       app//org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:692)
       app//org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
       java.base@11.0.6/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
       java.base@11.0.6/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
       java.base@11.0.6/java.lang.Thread.run(Thread.java:834)
   
   91.9% (459.6ms out of 500ms) cpu usage by thread 'elasticsearch[DATA-06][search][T#9]'
     8/10 snapshots sharing following 42 elements

We have PCIe NVME disks fort write cache, so we see index times of 0.3 to 0.7 milliseconds with an indexing rate of around 2,000 EPS, (about 1,000 EPS of that is the Wazuh Alerts )

Background and cluster information

12 node elasticSearch cluster (8 DATA, 3 MASTER, 1 COORDINATING)
CentOS 7.7
VMWare VMs with across 6 Hosts
DATA Nodes have 8 VCPUs 32GB Memory 2TB Disk
Primary stores Wazuh alerts
New wazuh-alerts index created every day, with 3 primaries 1 replica. (Was 8 primaries until about 4 days ago, decreased due to less data ingest as a result of other platform improvements)
Each index (3 primaries ones) is about 12GB with about 10,000,000 documents.
Indexes get force merged after 1 day, shrunk after 90, frozen after 120.

Cluster health output

{
  "cluster_name" : "REDACTED",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 12,
  "number_of_data_nodes" : 8,
  "active_primary_shards" : 2186,
  "active_shards" : 3873,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 100.0
}

Things we're implemented to try and fix

Moved to OpenJDK-11
Moved to G1GC garbage collector

G1GC Config:

-XX:-UseConcMarkSweepGC
-XX:+UseCondCardMark
-XX:MaxGCPauseMillis=200
-XX:+UseG1GC
-XX:GCPauseIntervalMillis=1000
-XX:InitiatingHeapOccupancyPercent=35

We would really appreciate some assistance with this, if you need anymore information please do let me know

Thanks in advance.

rcowart · March 13, 2020, 6:32pm

@bradley.king you mention an NVMe write cache, but a write cache won't help with read performance. What kind of disk is the data being read from? HDD or SSD, direct or network attached?

Also, how many days back are you trying to query?

For comparison... I have a fairly complex schema that would probably be considered more complex than Wazuh. And I could launch a dashboard which requires 22 queries to render, over a date range that includes 49M records (about 5 days of your data), and it completes in 9.5 seconds. That is on a single 4 core box (bare metal) w/32GB memory and a single local NVMe SSD.

Rob

How to install Elasticsearch & Kibana on Ubuntu
What is the best storage technology for Elasticsearch?

bradley.king · March 13, 2020, 11:41pm

Hi Rob, thanks for getting back to me.

So our we have 60/40 (60 write 40 read) read cache on the SSDs, backed by 6 RAID 10 15K 4TB SAS drives.

These elasticSearch VMs are running across a 6 host VMware cluster, each host with the above configuration.

One first thought was disk IO, however after looking into this further, we're not seeing much IO wait during the high CPU periods, either inside the VM, or when VMware hits the disks

Additionally, when monitoring inside the VM using Netdata during the high CPU periods, we barley see any disk usage at all, around 2 MBps, with 0% IO wait.

Would love to hear your thoughts on this

rcowart · March 14, 2020, 10:50am

There is actually a lot to this reply, so let's take it one step at a time...

I/O Wait

This is one of the most misunderstood CPU metrics. Most people assume this is a kind of CPU utilization, but it isn't. It is a kind of CPU idle time. It is basically saying... "this is the amount of idle time that is due to waiting on I/O". In the kernel accounting code the function is...

void account_idle_time(u64 cputime)
{
	u64 *cpustat = kcpustat_this_cpu->cpustat;
	struct rq *rq = this_rq();

	if (atomic_read(&rq->nr_iowait) > 0)
		cpustat[CPUTIME_IOWAIT] += cputime;
	else
		cpustat[CPUTIME_IDLE] += cputime;
}

Because I/O Wait is a kind of idle time, if other threads are available for execution they will get this available processor time. As a result, it is very possible to have threads which spend a significant amount of time waiting for I/O, yet still account for ZERO I/O wait, because the CPU is doing other work. So whenever you have very high CPU utilization, you cannot use I/O wait as a metric to determine storage performance.

Here is where this gets even more confusing. I/O wait is only recognized when a thread spends time in the kernel I/O scheduler. There are other calls which may be impacted by slow I/O which will NOT be accounted for in I/O Wait. Exactly how an application manifests its sensitivity to slow I/O in CPU metrics, is very implementation specific.

All of this means... that just because you don't see high I/O Wait doesn't mean that you aren't being impacted by slow storage.

Spinning Disks

I wont spend a lot of time here because I recently created an entire video on the subject, which you can watch here... What is the best storage technology for Elasticsearch?

The summary is... spinning disks are just a terrible terrible idea for Elasticsearch. Maybe OK for "warm nodes" in some scenarios, but generally very very bad.

RAID

It is also worth discussing your 6 disk RAID-10, which is really 3 disks when you remove the mirrors. In the past I have completed benchmarking where it is clear to see that anything other than multiples of 2 or 4 is less than optimal and odd numbers are almost useless for additional performance. 3 spindles will perform only marginally better than 2. However 4 spindles is nearly double that of 2. So your 6 disk RAID-10 might give you more capacity than a 4 disk RAID-10, but the performance is about the same. An 8 disk RAID-10 would provide a significant jump. BTW, this is true whether we are talking about HDDs or SSDs.

The above point aside... the real question is "Why RAID?" Elasticsearch provides redundancy via replicas. There is no need for additional redundancy at the hardware level. The rule is, configure storage for the best performance and let Elasticsearch take care of redundancy. (I understand there may be other applications running in this VMware environment for which you want RAID protection, but this is likely compromising your Elastic environment).

Storage Caches

The primary value of caches, especially a write cache, is to improve the performance of infrequent spikes in demand. However for sustained workloads you will always be limited by the underlying storage, regardless of the cache. This is because once the cache has filled, it can only accept new data at the rate that it can flush previous data.

You can also see this in things like TLC-based SSDs, which have a smaller SLC-based write cache. Short duration writes perform really well and can be written to the slower storage over time. However once a long duration write fills the cache it will block new writes until it can flush older data, which it can only due at the speed of the underlying slower storage.

If your 1K-2K EPS is enough to push your underlying HDD-based storage to its limits, which I suspect is the case, your write cache is largely irrelevant.

You also mention a 60-40 split for writes and reads. This sounds like you are using the same SSDs for both. This is generally a bad practice as each interferes with the other's optimal performance. Since the sustained writes render the write cache largely useless, you may be better off just dedicating the SSDs to read cache. That said, depending on how this cache is populated, it might not be of much value. When Elasticsearch merges segments it writes new segments, i.e. new files, and thus the older small segments would be deleted and removed from the cache. Any query that hits these new segments would have to first reload the blocks into the read cache from disk. This background merging of segments, could hinder the read cache quite a bit.

Speaking of segments merges... what is the value of the refresh_interval for these indices? A smaller interval results in more small segments being written, which must be subsequently be merged. Increasing this could help a little.

Network vs. Local Storage

You didn't mention whether storage is local or network attached. My assumption is local, but just in case I will cover it.

Network attached is worse case. Even if you have an all SSD array, once you add the latency of any interface cards/HBAs, network switches, etc. you have essentially turned those SSDs into a single spinning SATA HDD (from a latency perspective). Since latency has such a negative effect on IOPS, it is a serious limiter.

Page Cache

You mention that each VM has 32GB of memory. How much of that is dedicated to JVM heap? I realize that the Elastic recommendation is that half of RAM should be given to the JVM heap. However that is very use-case specific. If you have currently allocated 16GB, you might want to reduce this 12GB (maybe even 8GB) to make more RAM available for page cache. This should give you a bit of a boost as it helps avoid disk I/O for reads by holding more data in memory.

vCPUs

Keep in mind that vCPUs are not REAL CPU cores. Also it is important to understand that when VMware allocates CPU time to a VM, it will only do so if it has a "slot" available on the physical CPU for each vCPU. These slots are not "execution resources" (i.e. physical cores) they are "architectural state" (a fancy term for all of the registers which contain the details of the instructions to execute).

Hyper-threading provides 2 architectural states per core (execution resource). Keeping a second architectural state ready allows the cpu core to more efficiently context switch between two threads, rather than first loading the thread's state from memory. This efficiency improvement generally allows a CPU to do 10-15% more work overall.

Back to VMware. Given the number of data nodes and the number of VMware servers, it is clear that some servers must be running two data nodes (and presumably sharing the available storage IO - ouch!). Consider a CPU with 8 cores and hyper-threading, i.e. a CPU with 16 "architecture states". In theory VMware could allocate 16 vCPUs simultaneously. So it could execute work for both data nodes at the same time... or could it?

In reality there are other background processses that are using the CPU, such as executing the hypervisor kernel, logging, collecting metrics, serving the UI, etc. Since at any given time these processes are also occupying some of the available "architecture states", the two 8 vCPU VMs would only infrequently both get CPU resources at the same time. And even when they do, those vCPUs are not 8 real cores. The best performance expectation could only be that of 4 real cores plus 10-15%.

When evaluating CPU metrics from inside the VM, it is always in the context of how much CPU time (how many "ticks"?) was allocated to the VM. If you run top in your VM and find it is has 50% CPU utilization. That means that it is utilizing 50% of the time that the hypervisor has given it. Not that it is using 50% of the underlying CPU resources. In the event that the hypervisor could only allocate very little resources to the VM, it may show 100% utilization (it used all it was given), although it actually did very little real work. So the question is... are your VMs showing 100% utilization because of the high volume of work they are doing, or because of the limited CPU time they are allocated?

For this reason you might actually get better performance by reducing the number of vCPUs. This is because you increase the chance that the VM will be allocated slots by the scheduler. Ideally CPU bound applications should never run on a hypervisor where the sum of vCPUs across all VMs exceeds the number of REAL cores available.

Finally, imagine now that your vCPUs have finally been allocated some execution time. They ask for some data from disk, and due to slow I/O... BAM! They have to wait, are kicked off the CPU and have to wait in line again to be reallocated. It is a death spiral.

Recommendation

Given your current hardware, perhaps you could move all of the non-Elastic related VMs onto three of your VMware servers. Then re-deploy the other three servers as a simple three node bare-metal cluster, which would literally run circles around what you currently have.

Rob

How to install Elasticsearch & Kibana on Ubuntu - incl. hardware recommendations
What is the best storage technology for Elasticsearch?

bradley.king · March 16, 2020, 1:40pm

Hi Rob,

Thank you so much for you incredibly detailed reply, this is really useful

We're going to look at building a complete hardware cluster using all SSD storage based off your feedback.

Thanks again.

Brad

system · April 13, 2020, 1:40pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Need help to overcome 100% CPU Elasticsearch	18	14264	May 25, 2017
Help understanding hot threads (CPU 100% for several hours) Elasticsearch	2	1187	November 12, 2018
Elasticsearch eat 100% of cpu Elasticsearch	6	1065	October 20, 2020
High CPU usage on only 1 Data node Elasticsearch	7	920	October 16, 2020
100% CPU utilization on one of the data node, ES version 2.3 Elasticsearch	1	527	November 6, 2018