CPU utilization of the whole cluster spikes up to 100% suddenly

We have a 4 node(all data nodes) m4.2xlarge cluster on AWS. We have around 15 million docs spread across 4 indices.
Its a standard configuration 5 Primary, 1 replica. one of the index has 2 replicas.

Normal CPU usage is around 10-15%, but suddenly this goes to 100% and stays like that for 10-15 mins. Its same for all nodes. There are no spikes in traffic to ES(based on packetbeat data). Request queue length is being built up during this period, and lot of requests are getting rejected. Very few requests get succeeded during this period.

When we have 6 nodes, this spike is only around 60-70% CPU usage.


Heap usage seems normal.

Could this be because of segment merging? merging of a large segment or something?
Largest segment size is around 750 MB. this shouldn't be a problem I think, also there is no spike in DIsk IO. There is no correlation between cpu usage and segment merging based on marvel dashboards.

Could this be because of any GC, Again, I don't think this is the reason for CPU spikes. we do have Old GCs are also happening frequently, but they are there through out the day. I hear old gc is bad. but I am not sure if this is the reason for CPU spikes.


Spike in Young gc count could be a symptom of something else.

We don't have much filed data, we do have high filter cache and eviction rate. We are working on optimizing the queries, disabling cache for some fields.
FIlter cache limit is set to 30% of heap usage.

We do have hourly snapshots being taken to S3.

At this point we are clue less about how to debug this, is it possible that this has nothing to do with ES at all, we don't have anything else running on these systems other than packetbeat, cloudwatch monitoring scripts. We also send marvel stats to a different ES cluster.

My first guess is old GC, based on the graphs and the fact that requests are being rejected. I don't think it is segment merging per se (see caveat note below), since that tends to be "friendlier" and has throttling in place.

Your heaps are running right at 75%, which is the threshold Elasticsearch sets for GCs to start running. E.g. the JVM will avoid doing GCs until it reaches 75% heap usage, then it'll kick off an old-gen GC to clean the heap.

If the heap is full of actively used objects, old-gen GCs can be very expensive in terms of CPU because it has to start shuffling objects around the heap, compacting regions, etc. It also takes a long time. My guess is that you're seeing frequent, smaller old-gen GCs because the JVM is only requesting a small amount of memory... but every once in a while, it requests a larger block and the GC struggles to provide the memory, resulting in huge CPU spikes

  • What version of ES are you using?

  • Can you paste some of the old-gen GCs from your logs? It should show up as a warning, and list a duration + how much memory was reclaimed

  • What's your field-data usage look like?

  • Are you using nested docs, parent/child?

  • Do you happen to have a large number of fields and/or types in a single index? There is a known problem with how Lucene 4.x (used in ES 1.x) merges segments: when merging, it loads the norms for all fields in the segment into memory for the merge. If you have many, many fields this can eat up a lot of transient memory during a merge, and could trigger some heavy GCs.

  • Could you show the segment merging chart? Would be interesting to see if that lines up with the CPU usage (related to previous point)

  • If possible, you could try to collect a Hot Threads during one of the spikes, to get a sense of what threads are burning cpu cycles.

Hi, Thanks for your inputs.

Interesting to know about 75% limit.

ES version is 1.7.2,
openjdk version "1.8.0_60"
OpenJDK Runtime Environment (build 1.8.0_60-b24)
OpenJDK 64-Bit Server VM (build 25.60-b23, mixed mode)

Because of all the rejection errors, I can't find the GC logs easily, but I don't see any GC related logs even otherwise. since GC are happening through out the day, I should be able capture some logs.

logging level is set to INFO, I should be seeing GC logs,but I am not, not sure why
es.logger.level: INFO

Old GC times seems to be low

.
But its possible that once the request queue is built up...it make take some time for the CPU to be back to normal, may be after this 10-15 min duration cluster is able to keep up with the influx of requests.

Field data is not an issue I think, its very low, 200 MB per node.
I guess spikes could be related to filter cache, which is around 3 GB per node, I have set its size to 30%, may be I should set it back to default 10%? with 10%, CPU usage will increase because of higher eviction rate, but there may not be sudden spiikes?

I don't see anything wrong with segment merging

Note that, during this 10-15 minute period, I see few requests were successful based on packetbeat data. So I am guessing few seconds of OLD GC could lead to request build up & CPU spike. and probably it takes 10-15 for the system to be stable again?

any improvements that we can make? We are looking to optimize filter caches as I said in original post. any other suggestions?

Thanks.

Filter cache was one issue, We have disabled cache for few high cardinality fields. As my understanding goes, there will be a bitset for every key/value pair searched, so high cardinality fields can lead to lot of bitsets. So, we have disabled cache for these fields.

After these changes we have had 2 more spikes in last couple of days. These spikes seem a little different. They happened around when snapshots are being taken, there were spikes in DISK write and DSIK IOPS which were not there for earlier spikes

First spike


Second instance of spike

There is no increase in no of reqeusts to nodes. Increase in open search contexts could be because cluster is CPU deprived and each request could be taking longer.

Diskwrite iops is high, but there doesn't seem to be any segment merging. I can't explain spike in GC.

These last two spikes just happened after the snapshots, we take hourly snapshots to S3.
Does snapshotting lead to any segment merging to make snapshots efficient?

Disabling filter cache for timestamp and high cardinality fields seems to have resolved the issue.
I guess the above two spikes are because of something else.

There haven't been any spikes in last 2-3 weeks.