We have a 4 node(all data nodes) m4.2xlarge cluster on AWS. We have around 15 million docs spread across 4 indices.
Its a standard configuration 5 Primary, 1 replica. one of the index has 2 replicas.
Normal CPU usage is around 10-15%, but suddenly this goes to 100% and stays like that for 10-15 mins. Its same for all nodes. There are no spikes in traffic to ES(based on packetbeat data). Request queue length is being built up during this period, and lot of requests are getting rejected. Very few requests get succeeded during this period.
When we have 6 nodes, this spike is only around 60-70% CPU usage.
Heap usage seems normal.
Could this be because of segment merging? merging of a large segment or something?
Largest segment size is around 750 MB. this shouldn't be a problem I think, also there is no spike in DIsk IO. There is no correlation between cpu usage and segment merging based on marvel dashboards.
Could this be because of any GC, Again, I don't think this is the reason for CPU spikes. we do have Old GCs are also happening frequently, but they are there through out the day. I hear old gc is bad. but I am not sure if this is the reason for CPU spikes.
Spike in Young gc count could be a symptom of something else.
We don't have much filed data, we do have high filter cache and eviction rate. We are working on optimizing the queries, disabling cache for some fields.
FIlter cache limit is set to 30% of heap usage.
We do have hourly snapshots being taken to S3.
At this point we are clue less about how to debug this, is it possible that this has nothing to do with ES at all, we don't have anything else running on these systems other than packetbeat, cloudwatch monitoring scripts. We also send marvel stats to a different ES cluster.
My first guess is old GC, based on the graphs and the fact that requests are being rejected. I don't think it is segment merging per se (see caveat note below), since that tends to be "friendlier" and has throttling in place.
Your heaps are running right at 75%, which is the threshold Elasticsearch sets for GCs to start running. E.g. the JVM will avoid doing GCs until it reaches 75% heap usage, then it'll kick off an old-gen GC to clean the heap.
If the heap is full of actively used objects, old-gen GCs can be very expensive in terms of CPU because it has to start shuffling objects around the heap, compacting regions, etc. It also takes a long time. My guess is that you're seeing frequent, smaller old-gen GCs because the JVM is only requesting a small amount of memory... but every once in a while, it requests a larger block and the GC struggles to provide the memory, resulting in huge CPU spikes
What version of ES are you using?
Can you paste some of the old-gen GCs from your logs? It should show up as a warning, and list a duration + how much memory was reclaimed
What's your field-data usage look like?
Are you using nested docs, parent/child?
Do you happen to have a large number of fields and/or types in a single index? There is a known problem with how Lucene 4.x (used in ES 1.x) merges segments: when merging, it loads the norms for all fields in the segment into memory for the merge. If you have many, many fields this can eat up a lot of transient memory during a merge, and could trigger some heavy GCs.
Could you show the segment merging chart? Would be interesting to see if that lines up with the CPU usage (related to previous point)
If possible, you could try to collect a Hot Threads during one of the spikes, to get a sense of what threads are burning cpu cycles.
Hi, Thanks for your inputs.
Interesting to know about 75% limit.
ES version is 1.7.2,
openjdk version "1.8.0_60"
OpenJDK Runtime Environment (build 1.8.0_60-b24)
OpenJDK 64-Bit Server VM (build 25.60-b23, mixed mode)
Because of all the rejection errors, I can't find the GC logs easily, but I don't see any GC related logs even otherwise. since GC are happening through out the day, I should be able capture some logs.
logging level is set to INFO, I should be seeing GC logs,but I am not, not sure why
Old GC times seems to be low
But its possible that once the request queue is built up...it make take some time for the CPU to be back to normal, may be after this 10-15 min duration cluster is able to keep up with the influx of requests.
Field data is not an issue I think, its very low, 200 MB per node.
I guess spikes could be related to filter cache, which is around 3 GB per node, I have set its size to 30%, may be I should set it back to default 10%? with 10%, CPU usage will increase because of higher eviction rate, but there may not be sudden spiikes?
I don't see anything wrong with segment merging
Note that, during this 10-15 minute period, I see few requests were successful based on packetbeat data. So I am guessing few seconds of OLD GC could lead to request build up & CPU spike. and probably it takes 10-15 for the system to be stable again?
any improvements that we can make? We are looking to optimize filter caches as I said in original post. any other suggestions?
Disabling filter cache for timestamp and high cardinality fields seems to have resolved the issue.
I guess the above two spikes are because of something else.
There haven't been any spikes in last 2-3 weeks.