Our cluster sometimes dies due to excessive GC activity when we specify the "size" parameter of our queries to be too high: we have analysed our cache usage (fielddata is a minimum as we have doc_values activated and none of our fields are analyzed) and things look reasonable but we have seen that when SIZE is too high, we place a demand on Elastic to reserve a chunk of contiguous memory to hold the returning data structure that causes a lot of garbage collecting.
- How can we estimate the size of this data structure?
- Is it directly related to document size?
- How can we avoid that the cluster dies because of a query?
- How can we know that due to a to small query size we missed data records (hits)?
Thanks
Daniel
Some background information:
I noticed that there is a similar but old discussion "Optimizing queries for a 5 node cluster with 250 M documents (causes OutOfMemory exceptions and GC pauses)"
In some bad situations we see gc runs like this:
[2015-08-31 10:06:31,055][WARN ][monitor.jvm ] [node02] [gc][young][1457085][318412] duration [4.8m], collections 1/[5.1m], total [4.8m]/[4.9h], memory [27.5gb]->[12.8gb]/[30.9gb], all_pools {[young] [286.8mb]->[66.1mb]/[665.6mb]}{[survivor] [83.1mb]->[0b]/[83.1mb]}{[old] [27.2gb]->[12.7gb]/[30.1gb]}
Query size: raised from 200k to 2mio (lead to problems)
Average doc size: 1.2KB (mainly numerical values)
9 Data Nodes
2 Client Nodes (queries are going against these nodes)
Heap: 31GB (per node)
RAM: 126GB (per node)
SSD: 6TB (per node)
ES: 1.6.2
JVM: 1.8.0_51
Data: 12.5 TB
Documents: 8 Bil
Indices: 73
Shards: 834