Trying to figure out who is taking up all my memory


(Siddartha Guthikonda) #1

A few nodes in my elastic search cluster go into FullGC and essentially make the entire cluster unresponsive. I am trying to figure out what is the culprit here.

Initially I used to run with 30G of heap with 20% filter cache, 5% query cache and 2% field data. All my queries are aggregation and I use doc values for all my aggregation fields.
I am trying to make sense of jmap output and the caches.

This is few lines of my jmap output:
num #instances #bytes class name

1: 3745823 24201190056 [J
2: 13868630 17276247688 [B
3: 3198414 639682800 org.apache.lucene.codecs.blocktree.SegmentTermsEnumFrame
4: 5337158 469669904 org.apache.lucene.util.fst.FST$Arc
5: 10957748 438309920 org.apache.lucene.store.ByteArrayDataInput

and _nodes/stats for the node give this,

indices: {
docs: {
count: 17387611095,
deleted: 8063429134
},
store: {
size_in_bytes: 1271363405591,
throttle_time_in_millis: 0
},
indexing: {
index_total: 125957602,
index_time_in_millis: 889829487,
index_current: 4720,
delete_total: 0,
delete_time_in_millis: 0,
delete_current: 0,
noop_update_total: 0,
is_throttled: false,
throttle_time_in_millis: 218685
},
get: {
total: 75982247,
time_in_millis: 74856475,
exists_total: 71321749,
exists_time_in_millis: 70579494,
missing_total: 4660498,
missing_time_in_millis: 4276981,
current: 5
},
search: {
open_contexts: 128,
query_total: 40183,
query_time_in_millis: 132559014,
query_current: 49,
fetch_total: 17,
fetch_time_in_millis: 0,
fetch_current: 0
},
merges: {
current: 14,
current_docs: 35714530,
current_size_in_bytes: 1731561434,
total: 148128,
total_time_in_millis: 137983466,
total_docs: 34367327475,
total_size_in_bytes: 1874235885366
},
refresh: {
total: 615738,
total_time_in_millis: 56488556
},
flush: {
total: 36128,
total_time_in_millis: 16498523
},
warmer: {
current: 0,
total: 1403183,
total_time_in_millis: 110024
},
filter_cache: {
memory_size_in_bytes: 9376302840,
evictions: 596917
},
id_cache: {
memory_size_in_bytes: 0
},
fielddata: {
memory_size_in_bytes: 13895896,
evictions: 0
},
percolate: {
total: 0,
time_in_millis: 0,
current: 0,
memory_size_in_bytes: -1,
memory_size: "-1b",
queries: 0
},
completion: {
size_in_bytes: 0
},
segments: {
count: 4008,
memory_in_bytes: 977803492,
index_writer_memory_in_bytes: 446534626,
index_writer_max_memory_in_bytes: 4801386167,
version_map_memory_in_bytes: 4092564,
fixed_bit_set_memory_in_bytes: 2814178880
},
translog: {
operations: 111528,
size_in_bytes: 17
},
suggest: {
total: 0,
time_in_millis: 0,
current: 0
},
query_cache: {
memory_size_in_bytes: 0,
evictions: 0,
hit_count: 0,
miss_count: 0
},
recovery: {
current_as_source: 1,
current_as_target: 0,
throttle_time_in_millis: 6541316
}
},
os: {
timestamp: 1478028318162
},
process: {
timestamp: 1478028318162,
open_file_descriptors: 9092
},
jvm: {
timestamp: 1478028318179,
uptime_in_millis: 336709118,
mem: {
heap_used_in_bytes: 41667403000,
heap_used_percent: 86,
heap_committed_in_bytes: 48030547968,
heap_max_in_bytes: 48030547968,
non_heap_used_in_bytes: 129473208,
non_heap_committed_in_bytes: 131903488,
pools: {
young: {
used_in_bytes: 111476760,
max_in_bytes: 2303262720,
peak_used_in_bytes: 2303262720,
peak_max_in_bytes: 2303262720
},
survivor: {
used_in_bytes: 287834112,
max_in_bytes: 287834112,
peak_used_in_bytes: 287834112,
peak_max_in_bytes: 287834112
},
old: {
used_in_bytes: 41268094264,
max_in_bytes: 45439451136,
peak_used_in_bytes: 45439451112,
peak_max_in_bytes: 45439451136
}
}
},
threads: {
count: 682,
peak_count: 6588
}
},
buffer_pools: {
direct: {
count: 3119,
used_in_bytes: 156466600,
total_capacity_in_bytes: 156466600
},
mapped: {
count: 1101,
used_in_bytes: 345497122127,
total_capacity_in_bytes: 345497122127
}
}
}

So, my filter cache, filed data segments memory all combined together does not even come up to 50% of the heap, but the stats point to 40GB being used. I need help understanding why this is the case.

Also, can someone please help me understanding the first two lines of the jmap output and what they correspond to in ES
1: 3745823 24201190056 [J
2: 13868630 17276247688 [B

Thanks
Sid


(Mark Walkom) #2

What version?
What JVM?
What OS?
Do you have Marvel installed to see what is happening?


(Siddartha Guthikonda) #3

@warkolm
Elastic Search - 1.7.2
Java - 1.8
OS - rhel 6

I don't have marvel installed, but the stats show that my caches and segments memory does not take more than 15G of memory.

My current setup:
19 boxes: 24 cores - 256 GB RAM - SSD's
I initially had 30GB of RAM for Elastic Search but then I turned it up to 45 GB after I started seeing issues (Full GC's).
Total store size for all indices is around 6TB
I maintain index per tenant (around 300) and different number of shards per index .

Total Shards: 6114 (including one replica)


(system) #4