How to investigate heap usage spikes?

polyfractal · June 14, 2019, 7:55pm

Unfortunately, the query profiling tools (the Profile API, slowlog, and tasks) don't really have any memory information, just runtime/latency oriented metrics. Slowlog can be helpful if you know what to look for, e.g. some agg combinations are going to eat memory due to their nature and you develop a feel for that after looking at enough. But there's not a good tool at the moment right now for that.

A heap dump will give you a good idea of what's going on, but that's pretty non-trivial to collect. If possible that would 100% be the best way to diagnose what's going on during these heap pressure events. Should be pretty obvious if the memory is being consumed by an aggregation, bulk indexing, something else (suggester FST, etc).

For aggregations, it boils down to the number of buckets involved in the search basically. So deeply nested or very large aggs can eat a lot of memory. Some aggs (cardinality, percentiles) are much heavier than simple metrics like avg. Many concurrent aggregations can cause issues if each individual agg is smaller than the breaker, but when all together can allocate a large amount of heap.

Maybe someone remember bugs fixed after 5.5.2 that could be related to such behavior?

Not off the top of my head... what kind of aggregations are you running? It might spark a memory, or help diagnose the existing aggs.

Does new ES 7 real memory circuit breaker could help with such issues by preventing execution of heavy heap usage queries?

I think the real circuit breaker would help a lot, yes. It's not bullet proof in all regards, but does a fantastic job catching situations the normal breaker misses (e.g. many concurrent aggs that are individually small enough to pass the request breaker)

Related, search.max_buckets was introduced sometime in 6.x and puts a soft-limit on the number of buckets an agg can return (it aborts the agg if the limit is surpassed). This is A) a good way to limit queries that might be out of your control, and B) can be used to catch rogue/bad aggs. E.g. set it to a reasonable level and see if anything trips, throws an exception.

Can 5.5.2 circuit breakers (or some other setting) could help us prevent execution of heavy heap usage queries?

You could try setting the request breaker more conservatively, in an attempt to catch aggs that are borderline. But it's not nearly as robust as the real-memory breaker and may not help.

Another avenue: large bulk requests can eat up a lot of temporary heap space as they are processed/redirected to shards. It's definitely possible to eat up many gb of heap space with transient bulk requests, which can add a lot of unexpected heap pressure.

Topic		Replies	Views
ES 5.2.2: Sudden heap spikes followed by cluster crash Elasticsearch	15	5220	June 8, 2017
ES heap components Elasticsearch	4	1407	February 28, 2017
Heap usage holds steady at max and GC does not run. Need to force restart the cluster Elasticsearch elastic-stack-monitoring	5	558	July 6, 2023
Find out what causes long GC cycles on client node Elasticsearch	4	1633	July 5, 2017
ES 7.4 GC keeps reclaiming less memory on each pass Elasticsearch	9	530	May 24, 2020

How to investigate heap usage spikes?

Related topics