How to correlate costly queries with intense garbage collection leading to out of memory

Joseph_Smith1 · September 13, 2018, 1:29am

I have an ES cluster that is being used by multiple analysts who are issuing ad-hoc queries to perform data analysis. These analysts often formulate complex queries. Most of the time the cluster is stable, but occasionally one query will cause one or more nodes to get into a very low memory state and become unresponsive because it is perpetually garbage collecting. Eventually the node usually throws an OOM exception, but this can take up to an hour.

While I would love to actually prevent these cases from happening altogether (we do have circuit breakers set, but they don't seem to catch all cases), I am immediately interested in being able to determine which query is causing problems. I did try to enable the slow query log, but it does not seem to always log the offending query (I reproduced this by using a known "bad" query).

Is there any other best practice or logs that can help me easily track down queries that use very large amounts of memory?

Here is some additional information about my cluster:

ES version 2.6.4
8 machines with the same configuration:
40 cores, 256GB Ram, 8TB worth of SSDs in JBOD configuration, 2 ES Node Processes per machine
each Node process is configured to use 30GB for JVM Heap and are also configured for search, index and master functionality (I should probably add dedicated master nodes at least, but not sure if that matters for this problem)
these machines are a private cluster, not virtual machines in a cloud provider

Thanks in advance for any and all advice!

Brad_Quarry · October 3, 2018, 3:15pm

Hey Joseph, have you looked at disabling OS swap? Try these settings and see if it has a positive impact on the issue.

https://www.elastic.co/guide/en/elasticsearch/reference/current/setup-configuration-memory.html

Joseph_Smith1 · October 3, 2018, 4:55pm

We have had OS swap disabled during all of these reported incidents.

Brad_Quarry · October 3, 2018, 7:49pm

To answer your question directly...

Using Packetbeat you can ship all the query request results to an ES index for tracking in a time series graph.

Metricbeat will allow you to track jvm heap

Put both together in a dashboard and you can narrow down the offenders fairly easily.

Joseph_Smith1 · October 10, 2018, 9:35pm

Thanks!

system · November 7, 2018, 9:35pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Continuous garbage collection Elasticsearch	3	731	April 26, 2017
Garbage Collection in ES Elasticsearch	8	3729	July 6, 2017
Track Queries with High Heap Usage Elasticsearch	6	579	May 12, 2022
Frequent OutOfMemory crashes Elasticsearch	2	2397	August 4, 2017
Throttling / Forcing Garbage Collection during Bulk Indexing Elasticsearch	4	2999	July 6, 2017

How to correlate costly queries with intense garbage collection leading to out of memory

Related topics