How to correlate costly queries with intense garbage collection leading to out of memory

I have an ES cluster that is being used by multiple analysts who are issuing ad-hoc queries to perform data analysis. These analysts often formulate complex queries. Most of the time the cluster is stable, but occasionally one query will cause one or more nodes to get into a very low memory state and become unresponsive because it is perpetually garbage collecting. Eventually the node usually throws an OOM exception, but this can take up to an hour.

While I would love to actually prevent these cases from happening altogether (we do have circuit breakers set, but they don't seem to catch all cases), I am immediately interested in being able to determine which query is causing problems. I did try to enable the slow query log, but it does not seem to always log the offending query (I reproduced this by using a known "bad" query).

Is there any other best practice or logs that can help me easily track down queries that use very large amounts of memory?

Here is some additional information about my cluster:

  • ES version 2.6.4
  • 8 machines with the same configuration:
  • 40 cores, 256GB Ram, 8TB worth of SSDs in JBOD configuration, 2 ES Node Processes per machine
  • each Node process is configured to use 30GB for JVM Heap and are also configured for search, index and master functionality (I should probably add dedicated master nodes at least, but not sure if that matters for this problem)
  • these machines are a private cluster, not virtual machines in a cloud provider

Thanks in advance for any and all advice!

Hey Joseph, have you looked at disabling OS swap? Try these settings and see if it has a positive impact on the issue.

We have had OS swap disabled during all of these reported incidents.

To answer your question directly...

Using Packetbeat you can ship all the query request results to an ES index for tracking in a time series graph.

Metricbeat will allow you to track jvm heap

Put both together in a dashboard and you can narrow down the offenders fairly easily.

1 Like


This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.