Cluster sometimes dies due to excessive GC activity - query size

dko · September 3, 2015, 12:53pm

Our cluster sometimes dies due to excessive GC activity when we specify the "size" parameter of our queries to be too high: we have analysed our cache usage (fielddata is a minimum as we have doc_values activated and none of our fields are analyzed) and things look reasonable but we have seen that when SIZE is too high, we place a demand on Elastic to reserve a chunk of contiguous memory to hold the returning data structure that causes a lot of garbage collecting.

How can we estimate the size of this data structure?
Is it directly related to document size?
How can we avoid that the cluster dies because of a query?
How can we know that due to a to small query size we missed data records (hits)?

Thanks
Daniel

Some background information:

I noticed that there is a similar but old discussion "Optimizing queries for a 5 node cluster with 250 M documents (causes OutOfMemory exceptions and GC pauses)"

In some bad situations we see gc runs like this:
[2015-08-31 10:06:31,055][WARN ][monitor.jvm ] [node02] [gc][young][1457085][318412] duration [4.8m], collections 1/[5.1m], total [4.8m]/[4.9h], memory [27.5gb]->[12.8gb]/[30.9gb], all_pools {[young] [286.8mb]->[66.1mb]/[665.6mb]}{[survivor] [83.1mb]->[0b]/[83.1mb]}{[old] [27.2gb]->[12.7gb]/[30.1gb]}

Query size: raised from 200k to 2mio (lead to problems)
Average doc size: 1.2KB (mainly numerical values)

9 Data Nodes
2 Client Nodes (queries are going against these nodes)
Heap: 31GB (per node)
RAM: 126GB (per node)
SSD: 6TB (per node)
ES: 1.6.2
JVM: 1.8.0_51

Data: 12.5 TB
Documents: 8 Bil
Indices: 73
Shards: 834

s1monw · September 4, 2015, 8:38pm

this is a way too high value. Elasticsearch is a top N retrieval engine you should look at the top N where N is small like 10 or 20. 100 is fine too. But 200k or 2M is not what this system is designed for. If you still need to look at all docs you should consider scan/scroll search that is designed to be memory efficient since you don't materialize all the docs in memory at once.

dko · September 7, 2015, 8:56am

Thanks Simon for your answer.
We are already using scan/scroll in some scenarios where the Java client is used to build aggregations. But we hesitate using this method with our PHP client for the fronted where query time is important. Nevertheless we will look into this and return with our results.

For the meantime, is there any option to limit the size of a query globally so that nothing can break the cluster?

s1monw · September 7, 2015, 1:29pm

we are working on that here https://github.com/elastic/elasticsearch/pull/13188

I don't understand, what's the performance benefit / drawback here? You are killing your cluster so how bad could it be to prevent that?

Topic		Replies	Views
Optimizing queries for a 5 node cluster with 250 M documents (causes OutOfMemory exceptions and GC pauses) Elasticsearch	10	1566	July 6, 2017
How to prevent ES from garbage collecting during several minutes Elasticsearch	7	2026	July 6, 2017
Help with OutOfMemory and some other memory concerns Elasticsearch	7	406	July 6, 2017
Problem with heap space overusage Elasticsearch	7	1888	July 6, 2017
Java.lang.OutOfMemoryError: Java heap space Elasticsearch	25	3485	July 6, 2017

Cluster sometimes dies due to excessive GC activity - query size

Related topics