I was wondering the best way to track queries based on their Heap usage.
Context: I've recently had to increase heap allocation (currently 20GB RAM - 18GB Heap, up from 10GB RAM - 8GB Heap) on coordinating nodes (3 nodes in the cluster) in a cluster, as they are frequently hitting circuit breakers. This cluster has become more actively used, so while I would have expected to increase the heap a little, I wouldn't have expected to increase it this much.
I suspect that potentially a few newer queries are using far more heap than I would expect, but I haven't been able to find a reliable way to track queries based on heap usage.
I looked at Slow Log, but I don't think this is the correct approach, as the queries being executed aren't slow, they just (potentially) use a good amount of heap.
The queries are all from Kibana features (Observability rules, SIEM rules), so I don't have too much control/ability to debug the actual queries being executed.
Hi @DineshNaik, the queries all come from Kibana "Rules", the main rules that we use are the Metrics rule under Observability, and a few Log Threshold rules also under observability. We also have SIEM rules (mainly the prebuilt ones), but these rules haven't really changed much since the noticeable uptick in Heap usage, so it leads me to believe that some of the Metric or Log Threshold rules have/cause an issue.
Regarding OOM kills, so far, I have not noticed any OOM kills, but I have seen the coordinating nodes lock up for several minutes, near the point of getting OOM killed as the Heap takes a while to free up.
Regarding date ranges, most of the queries only look back the last ~5 minutes.
The data has grown a bit, but not what I'd consider drastic. (Context: before the issue, cluster handled ~50k e/s, cluster now handles 65k e/s, so I wouldn't expect a doubling of heap usage requirement)
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.