We have a 4 node(all data nodes) m4.2xlarge cluster on AWS. We have around 15 million docs spread across 4 indices.
Its a standard configuration 5 Primary, 1 replica. one of the index has 2 replicas.
Normal CPU usage is around 10-15%, but suddenly this goes to 100% and stays like that for 10-15 mins. Its same for all nodes. There are no spikes in traffic to ES(based on packetbeat data). Request queue length is being built up during this period, and lot of requests are getting rejected. Very few requests get succeeded during this period.
When we have 6 nodes, this spike is only around 60-70% CPU usage.
Heap usage seems normal.
Could this be because of segment merging? merging of a large segment or something?
Largest segment size is around 750 MB. this shouldn't be a problem I think, also there is no spike in DIsk IO. There is no correlation between cpu usage and segment merging based on marvel dashboards.
Could this be because of any GC, Again, I don't think this is the reason for CPU spikes. we do have Old GCs are also happening frequently, but they are there through out the day. I hear old gc is bad. but I am not sure if this is the reason for CPU spikes.
Spike in Young gc count could be a symptom of something else.
We don't have much filed data, we do have high filter cache and eviction rate. We are working on optimizing the queries, disabling cache for some fields.
FIlter cache limit is set to 30% of heap usage.
We do have hourly snapshots being taken to S3.
At this point we are clue less about how to debug this, is it possible that this has nothing to do with ES at all, we don't have anything else running on these systems other than packetbeat, cloudwatch monitoring scripts. We also send marvel stats to a different ES cluster.