This is regarding 8 nodes that have these settings:
node.data: true node.master: false node.ingest: true
-Xms30g -Xmx30g -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly
EC2 instances with 8 CPU and 61 GB of RAM (r4.2xlarge)
Recently the CPU on these nodes has been shooting from 30% used on avg to over 90% for over 1 hour straight. During this time the young GC time spikes up as well.
I looked at query cache hit and miss metrics but it didn't correlate well with this behavior.
The heap used pattern does correlate though. It looks like garbage collection is just continuously having issues freeing up larger chunks of the heap for long periods of time.
Here is the impact it has on my search and index latency:
This behavior just started happening to me a few days ago without any change that I know of. Having issues determining what could be the root cause or even what the next step is in the investigation.