Unfortunately, the query profiling tools (the Profile API, slowlog, and tasks) don't really have any memory information, just runtime/latency oriented metrics. Slowlog can be helpful if you know what to look for, e.g. some agg combinations are going to eat memory due to their nature and you develop a feel for that after looking at enough. But there's not a good tool at the moment right now for that.
A heap dump will give you a good idea of what's going on, but that's pretty non-trivial to collect. If possible that would 100% be the best way to diagnose what's going on during these heap pressure events. Should be pretty obvious if the memory is being consumed by an aggregation, bulk indexing, something else (suggester FST, etc).
For aggregations, it boils down to the number of buckets involved in the search basically. So deeply nested or very large aggs can eat a lot of memory. Some aggs (cardinality
, percentiles
) are much heavier than simple metrics like avg
. Many concurrent aggregations can cause issues if each individual agg is smaller than the breaker, but when all together can allocate a large amount of heap.
Maybe someone remember bugs fixed after 5.5.2 that could be related to such behavior?
Not off the top of my head... what kind of aggregations are you running? It might spark a memory, or help diagnose the existing aggs.
Does new ES 7 real memory circuit breaker could help with such issues by preventing execution of heavy heap usage queries?
I think the real circuit breaker would help a lot, yes. It's not bullet proof in all regards, but does a fantastic job catching situations the normal breaker misses (e.g. many concurrent aggs that are individually small enough to pass the request breaker)
Related, search.max_buckets
was introduced sometime in 6.x and puts a soft-limit on the number of buckets an agg can return (it aborts the agg if the limit is surpassed). This is A) a good way to limit queries that might be out of your control, and B) can be used to catch rogue/bad aggs. E.g. set it to a reasonable level and see if anything trips, throws an exception.
Can 5.5.2 circuit breakers (or some other setting) could help us prevent execution of heavy heap usage queries?
You could try setting the request
breaker more conservatively, in an attempt to catch aggs that are borderline. But it's not nearly as robust as the real-memory breaker and may not help.
Another avenue: large bulk requests can eat up a lot of temporary heap space as they are processed/redirected to shards. It's definitely possible to eat up many gb of heap space with transient bulk requests, which can add a lot of unexpected heap pressure.