CPU and Garbage Collection Time Reguraily Spike For Long Periods of Time

Hello,

Elasticsearch 6.7

This is regarding 8 nodes that have these settings:

node.data: true
node.master: false
node.ingest: true
-Xms30g
-Xmx30g
-XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=75
-XX:+UseCMSInitiatingOccupancyOnly

EC2 instances with 8 CPU and 61 GB of RAM (r4.2xlarge)

Recently the CPU on these nodes has been shooting from 30% used on avg to over 90% for over 1 hour straight. During this time the young GC time spikes up as well.

I looked at query cache hit and miss metrics but it didn't correlate well with this behavior.

The heap used pattern does correlate though. It looks like garbage collection is just continuously having issues freeing up larger chunks of the heap for long periods of time.

Here is the impact it has on my search and index latency:

This behavior just started happening to me a few days ago without any change that I know of. Having issues determining what could be the root cause or even what the next step is in the investigation.

Curious...
How many shards are on each node?
What is the average shard size?
Has the number of shards been increasing lately?

Are you tracking any statistics on I/O utilization and/or iowait that might correlate? Do you have any charts on merging activity? Is there anything in the logs around the time the increased CPU usage starts?

@stephenb
I have about 850 shards on each of these nodes and the shards are less than 50GB in size. I have a combination of ILM and daily indices. ILM rolls over at 50GB or after 14 days and the daily indices that go over 50GB have a index template with a number of shards to ensure each shard is less than 50GB in size.

I just changed the default index template to be 1 shard instead of 4 so the number of shards has actually been decreasing.

@Christian_Dahlqvist
IO wait does correlate with these episodes. I have been averaging an IO wait around 5 on each of these nodes but during these episodes the IO wait drops to about 1. Haven't been able to find anything useful in the logs but I may need to play with logger config.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.