CPU and Garbage Collection Time Reguraily Spike For Long Periods of Time

stockhausenj · October 18, 2019, 11:55pm

Hello,

Elasticsearch 6.7

This is regarding 8 nodes that have these settings:

node.data: true
node.master: false
node.ingest: true

-Xms30g
-Xmx30g
-XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=75
-XX:+UseCMSInitiatingOccupancyOnly

EC2 instances with 8 CPU and 61 GB of RAM (r4.2xlarge)

Recently the CPU on these nodes has been shooting from 30% used on avg to over 90% for over 1 hour straight. During this time the young GC time spikes up as well.

I looked at query cache hit and miss metrics but it didn't correlate well with this behavior.

The heap used pattern does correlate though. It looks like garbage collection is just continuously having issues freeing up larger chunks of the heap for long periods of time.

Here is the impact it has on my search and index latency:

This behavior just started happening to me a few days ago without any change that I know of. Having issues determining what could be the root cause or even what the next step is in the investigation.

stephenb · October 19, 2019, 5:46am

Curious...
How many shards are on each node?
What is the average shard size?
Has the number of shards been increasing lately?

Christian_Dahlqvist · October 19, 2019, 6:17am

Are you tracking any statistics on I/O utilization and/or iowait that might correlate? Do you have any charts on merging activity? Is there anything in the logs around the time the increased CPU usage starts?

stockhausenj · October 21, 2019, 5:12pm

@stephenb
I have about 850 shards on each of these nodes and the shards are less than 50GB in size. I have a combination of ILM and daily indices. ILM rolls over at 50GB or after 14 days and the daily indices that go over 50GB have a index template with a number of shards to ensure each shard is less than 50GB in size.

I just changed the default index template to be 1 shard instead of 4 so the number of shards has actually been decreasing.

@Christian_Dahlqvist
IO wait does correlate with these episodes. I have been averaging an IO wait around 5 on each of these nodes but during these episodes the IO wait drops to about 1. Haven't been able to find anything useful in the logs but I may need to play with logger config.

system · November 18, 2019, 5:12pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Garbage Collection in ES Elasticsearch	8	3729	July 6, 2017
CPU utilization of the whole cluster spikes up to 100% suddenly Elasticsearch	6	11728	July 5, 2017
Garbage collector taking more time with increase in search requests Elasticsearch	2	398	July 6, 2017
Continous GC on Master Node Elasticsearch	7	864	October 4, 2018
Garbage collector issue on elasticsearch 2.2.4 Elasticsearch	4	431	November 9, 2018

CPU and Garbage Collection Time Reguraily Spike For Long Periods of Time

Related topics