Entire cluster easily disrupted with sizeable geospatial query (OOM)

Oli_McCormack · July 16, 2013, 8:18pm

Hi all,

Recently I've discovered that I can knock our entire cluster offline for a
period of time by executing a reasonably sized geo-query. My suspicion is
that this is not isolated to geo-queries, but that this is just an easy way
to reproduce the problem. I've documented a simple, complete repro case in
this gist https://gist.github.com/olimcc/ee70a6970367b241e100.

Here's what I observe:

Issue query to service.
Every node in the cluster becomes unresponsive over the next minute or
so. CPU shoots up and maintains at almost full consumption on every
machine, memory in the JVM is also at capacity. My assumption is that the
query has been sent to all nodes in parallel and is now consuming their
resources.
This persists for maybe 15-20 mins or more. Many nodes throw OOM.
Nodes occasionally rejoin a cluster and re-elect masters. (splitbrain is
quite common).
The only way I've been able to completely resolve this has been to
manually kill all nodes in the cluster and bring them back one by one.

Note that the issue occurs based on the search alone, not results, there is
no data stored in the service. We use a QuadPrefixTree, and my
understanding is that a number of the tree nodes are loaded into memory
before results are retrieved from them, which may be causing this. I've
attempted to estimate the number of nodes that will be loaded and block
queries from my client if the number is too great.. but this seems hacky,
I'd love a proper solution.

I'm primarily interested in preventing this from happening. I would be
really interested to hear about any ways I can do this without increasing
allocated memory. I am not concerned about recovering from split brain at
the moment (I think it's a separate issue than this cause of it).

I'm wondering:

Is there any way Elasticsearch itself can stop this event happening?
Or do I need to, in my client, inspect every query before I execute
it, to ensure it's not too large?
If I assume a bad query takes a long period of time, can I or ES kill
the query after a period? Research in the docs/mailing list suggest I can't
do this.

Happy to provide other information that would be useful here, and work
through any suggestions people have.
Thanks very much,
Oli

Cluster information
Number of nodes: 5
Java: Sun Java HotSpot(TM) 64-Bit Server VM, 1.6.0_37
Heap allocation: 6gb (on a 7.5gb box - this doesn't match the 50% rule
often mentioned, I can change this if it's related)
Shards: 20, approximately ~10gb each. 2 replicas.
ES Version: 0.20.4
Geo PrefixTree in use: QuadPrefixTree

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Topic		Replies	Views
A few general questions about Elasticsearch Elasticsearch	14	906	April 6, 2018
Elasticsearch Died on me Elasticsearch	22	1809	July 1, 2018
Elasticsearch Cleint Nodes OOM Killed by Gargantuan Query Elasticsearch	3	695	June 27, 2019
Cluster sometimes dies due to excessive GC activity - query size Elasticsearch	4	1682	July 5, 2017
Unresponsive cluster after too large of a query (OutOfMemoryError: Java heap space)? Elasticsearch	7	811	July 6, 2017

Entire cluster easily disrupted with sizeable geospatial query (OOM)

Related topics