We're running ES 5.5.2 and facing sporadic long GC problems. Getting GC logs and analyzing them we see two main reasons for long GC - allocation and promotion failures. Which correlates with hudge spikes of heap usage on affected nodes. So something acquire a lot of heap (from 16G to 25G) which forcing full "stop-the-world" GC which sometimes goes up to several seconds.
We've put NGINX before ES nodes and writing logs and see no anomalies (for example, in response size which could indicate some big data queries).
We think there could be two main reasons:
- Some nasty query, that not getting a lot of data back but forcing ES use a lot of heap to execute it.
- Some bug in ES (like the one that was fixed in 5.2 with aggregations).
My questions are:
- Are there any means to get information about memory footprint of each query (for example, in headers - this way we could log them and track them down)? Like read/writes for each query in sql profilers?
- Are there some way to get some sort of slowlog but based on heap memory usage of the query?
- Maybe someone remember bugs fixed after 5.5.2 that could be related to such behavior?
- Does new ES 7 real memory circuit breaker could help with such issues by preventing execution of heavy heap usage queries?
- Can 5.5.2 circuit breakers (or some other setting) could help us prevent execution of heavy heap usage queries?