Track Queries with High Heap Usage

Hi All,

I was wondering the best way to track queries based on their Heap usage.

Context: I've recently had to increase heap allocation (currently 20GB RAM - 18GB Heap, up from 10GB RAM - 8GB Heap) on coordinating nodes (3 nodes in the cluster) in a cluster, as they are frequently hitting circuit breakers. This cluster has become more actively used, so while I would have expected to increase the heap a little, I wouldn't have expected to increase it this much.

I suspect that potentially a few newer queries are using far more heap than I would expect, but I haven't been able to find a reliable way to track queries based on heap usage.

I looked at Slow Log, but I don't think this is the correct approach, as the queries being executed aren't slow, they just (potentially) use a good amount of heap.

The queries are all from Kibana features (Observability rules, SIEM rules), so I don't have too much control/ability to debug the actual queries being executed.

Setup:
Elasticsearch Version: 7.17.2
Kibana Version: 7.17.2
Install Method: Kubernetes/Containers/ECK

Usually queries with lots of aggregation and sort tend to cause higher cpu and memory usage.

What kind of queries you have in your application?

Have you seen out of memory issues in any node . One way to find such queries would be to analyse the heapdumps generated on OOM scenarios.

Hi @DineshNaik, the queries all come from Kibana "Rules", the main rules that we use are the Metrics rule under Observability, and a few Log Threshold rules also under observability. We also have SIEM rules (mainly the prebuilt ones), but these rules haven't really changed much since the noticeable uptick in Heap usage, so it leads me to believe that some of the Metric or Log Threshold rules have/cause an issue.

Regarding OOM kills, so far, I have not noticed any OOM kills, but I have seen the coordinating nodes lock up for several minutes, near the point of getting OOM killed as the Heap takes a while to free up.

What about date ranges , has the data grown drastically and you are looking for all of it ?

Regarding date ranges, most of the queries only look back the last ~5 minutes.

The data has grown a bit, but not what I'd consider drastic. (Context: before the issue, cluster handled ~50k e/s, cluster now handles 65k e/s, so I wouldn't expect a doubling of heap usage requirement)

What are your compute configurations infra wise ?
In slow logs have you checked if some queries are taking more time than usual?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.