How to investigate heap usage spikes?

We're running ES 5.5.2 and facing sporadic long GC problems. Getting GC logs and analyzing them we see two main reasons for long GC - allocation and promotion failures. Which correlates with hudge spikes of heap usage on affected nodes. So something acquire a lot of heap (from 16G to 25G) which forcing full "stop-the-world" GC which sometimes goes up to several seconds.

We've put NGINX before ES nodes and writing logs and see no anomalies (for example, in response size which could indicate some big data queries).

We think there could be two main reasons:

  1. Some nasty query, that not getting a lot of data back but forcing ES use a lot of heap to execute it.
  2. Some bug in ES (like the one that was fixed in 5.2 with aggregations).

My questions are:

  1. Are there any means to get information about memory footprint of each query (for example, in headers - this way we could log them and track them down)? Like read/writes for each query in sql profilers?
  2. Are there some way to get some sort of slowlog but based on heap memory usage of the query?
  3. Maybe someone remember bugs fixed after 5.5.2 that could be related to such behavior?
  4. Does new ES 7 real memory circuit breaker could help with such issues by preventing execution of heavy heap usage queries?
  5. Can 5.5.2 circuit breakers (or some other setting) could help us prevent execution of heavy heap usage queries?

Unfortunately, the query profiling tools (the Profile API, slowlog, and tasks) don't really have any memory information, just runtime/latency oriented metrics. Slowlog can be helpful if you know what to look for, e.g. some agg combinations are going to eat memory due to their nature and you develop a feel for that after looking at enough. But there's not a good tool at the moment right now for that.

A heap dump will give you a good idea of what's going on, but that's pretty non-trivial to collect. If possible that would 100% be the best way to diagnose what's going on during these heap pressure events. Should be pretty obvious if the memory is being consumed by an aggregation, bulk indexing, something else (suggester FST, etc).

For aggregations, it boils down to the number of buckets involved in the search basically. So deeply nested or very large aggs can eat a lot of memory. Some aggs (cardinality, percentiles) are much heavier than simple metrics like avg. Many concurrent aggregations can cause issues if each individual agg is smaller than the breaker, but when all together can allocate a large amount of heap.

Maybe someone remember bugs fixed after 5.5.2 that could be related to such behavior?

Not off the top of my head... what kind of aggregations are you running? It might spark a memory, or help diagnose the existing aggs.

Does new ES 7 real memory circuit breaker could help with such issues by preventing execution of heavy heap usage queries?

I think the real circuit breaker would help a lot, yes. It's not bullet proof in all regards, but does a fantastic job catching situations the normal breaker misses (e.g. many concurrent aggs that are individually small enough to pass the request breaker)

Related, search.max_buckets was introduced sometime in 6.x and puts a soft-limit on the number of buckets an agg can return (it aborts the agg if the limit is surpassed). This is A) a good way to limit queries that might be out of your control, and B) can be used to catch rogue/bad aggs. E.g. set it to a reasonable level and see if anything trips, throws an exception.

Can 5.5.2 circuit breakers (or some other setting) could help us prevent execution of heavy heap usage queries?

You could try setting the request breaker more conservatively, in an attempt to catch aggs that are borderline. But it's not nearly as robust as the real-memory breaker and may not help.

Another avenue: large bulk requests can eat up a lot of temporary heap space as they are processed/redirected to shards. It's definitely possible to eat up many gb of heap space with transient bulk requests, which can add a lot of unexpected heap pressure.

Thanks for answer!
I guess trying to move to ES7 is our best bet.
Btw, would ES7 circuit breaker (or something else) "catch" large transient bulk requests too?

PS: I really wish ES could provide information about query execution performance metrics in response headers. Or some sort of tool like RDBMS realtime profiler.