I have an ES cluster that is being used by multiple analysts who are issuing ad-hoc queries to perform data analysis. These analysts often formulate complex queries. Most of the time the cluster is stable, but occasionally one query will cause one or more nodes to get into a very low memory state and become unresponsive because it is perpetually garbage collecting. Eventually the node usually throws an OOM exception, but this can take up to an hour.
While I would love to actually prevent these cases from happening altogether (we do have circuit breakers set, but they don't seem to catch all cases), I am immediately interested in being able to determine which query is causing problems. I did try to enable the slow query log, but it does not seem to always log the offending query (I reproduced this by using a known "bad" query).
Is there any other best practice or logs that can help me easily track down queries that use very large amounts of memory?
Here is some additional information about my cluster:
- ES version 2.6.4
- 8 machines with the same configuration:
- 40 cores, 256GB Ram, 8TB worth of SSDs in JBOD configuration, 2 ES Node Processes per machine
- each Node process is configured to use 30GB for JVM Heap and are also configured for search, index and master functionality (I should probably add dedicated master nodes at least, but not sure if that matters for this problem)
- these machines are a private cluster, not virtual machines in a cloud provider
Thanks in advance for any and all advice!