EC2 instances with 8 CPU and 61 GB of RAM (r4.2xlarge)
Recently the CPU on these nodes has been shooting from 30% used on avg to over 90% for over 1 hour straight. During this time the young GC time spikes up as well.
I looked at query cache hit and miss metrics but it didn't correlate well with this behavior.
The heap used pattern does correlate though. It looks like garbage collection is just continuously having issues freeing up larger chunks of the heap for long periods of time.
This behavior just started happening to me a few days ago without any change that I know of. Having issues determining what could be the root cause or even what the next step is in the investigation.
Are you tracking any statistics on I/O utilization and/or iowait that might correlate? Do you have any charts on merging activity? Is there anything in the logs around the time the increased CPU usage starts?
@stephenb
I have about 850 shards on each of these nodes and the shards are less than 50GB in size. I have a combination of ILM and daily indices. ILM rolls over at 50GB or after 14 days and the daily indices that go over 50GB have a index template with a number of shards to ensure each shard is less than 50GB in size.
I just changed the default index template to be 1 shard instead of 4 so the number of shards has actually been decreasing.
@Christian_Dahlqvist
IO wait does correlate with these episodes. I have been averaging an IO wait around 5 on each of these nodes but during these episodes the IO wait drops to about 1. Haven't been able to find anything useful in the logs but I may need to play with logger config.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.