5.5.0 - JVM Heap and Garbage Collection issues when executing queries

awebb · October 4, 2018, 10:38pm

Hi,

Hoping someone can point me in the right direction with an issue related to JVM heap and garbage collection issues when executing search queries. Here's what I know so far:

Use case

Performing queries against an index which is populated with documents (word, excel, pdf, email etc...)
Elastic documents contain properties related to the document, plus a "content" field which contains the extracted content of the document (so, can get quite large depending on the document's content)
Queries are performed ad-hoc by users (currently a very low rate) and are free text searches i.e. find all documents containing certain words or phrases
Index rate for each index is rather low i.e. ~5000 an hour for an example index

Hardware/Elastic Setup

VMs running in Azure
3 master, 3 client and 3 data nodes
Data nodes - 4 vCPU, 14Gb RAM
7gb RAM allocated to JVM heap, swapping disabled, memory locked

Index Configuration

Index per tenant (~30 so far) - 10 primary shards, 1 replica, 30s index refresh
size ranges from 1gb to 400gb (I'm currently testing against a 50Gb index, which is the largest on the specific cluster it is hosted on)
2 document types, 15 properties, most are keyword, text, Boolean, long, double and one nested property

Issue

When performing certain queries (that return a limited number of results i.e. terms are infrequent in the target index), the query will take 15-60+ seconds
In these cases, the data nodes normally log that they are performing garbage collection, which obviously is affecting the query performance
In the slow query log, the fetch phase for certain shards can range from ~200ms to 12+ seconds. Which is reflected by the node GC logging
We have had occasions where OOM exceptions are thrown
Mostly, Kibana shows that there is a massive drop in heap usage due to GC

Example

Below is an example query
The "accessControlLevels" terms filter is used to determine security requirements and includes multiple (27 in this case) entries that follow the below patterns
"aduser:726877fb-ac7a-48fc-ac90-715a1a558f22"
"adgroup:5eff07e9-388f-4219-bd6e-dc746b04fcd2"
"email:someone@company.com"

[2018-10-04T20:41:54,177][WARN ][index.search.slowlog.fetch] [datanode-2] [fd23a7d3-362d-406b-82ba-7f7155734ab6_1002][2] took[12.7s], took_millis[12783], types[contentobject], stats[], search_type[QUERY_THEN_FETCH], total_shards[10], so urce[{"from":0,"size":25,"query":{"bool":{"must":[{"query_string":{"query":"abc def","fields":[],"use_dis_max":true,"tie_breaker":0.0,"default_operator":"and","auto_generate_phrase_queries":false,"max_determinized_states":10000,"enable_p osition_increments":true,"fuzziness":"AUTO","fuzzy_prefix_length":0,"fuzzy_max_expansions":50,"phrase_slop":0,"escape":false,"split_on_whitespace":true,"boost":1.0}}],"filter":[{"terms":{"accessControlLevels":[*see explanation*],"boost":1.0}}],"disable_coord":false,"adjust_pure_negative":true,"boost":1.0}},"_source":{"includes":["*"],"excludes":["content"]},"sort":[{"uniqueId":{"order":"asc"}}],"highlight":{"requi re_field_match":false,"fields":{"content":{"no_match_size":200}}}}],

Example node log GC entry below

[2018-10-04T20:17:11,242][INFO ][o.e.m.j.JvmGcMonitorService] [datanode-2] [gc][old][5696286][1085] duration [8.3s], collections [1]/[8.7s], total [8.3s]/[22.9m], memory [6.4gb]->[2.7gb]/[6.8gb], all_pools {[young] [361.2kb]->[7.1mb]/[2 66.2mb]}{[survivor] [33.2mb]->[0b]/[33.2mb]}{[old] [6.4gb]->[2.7gb]/[6.5gb]} [2018-10-04T20:17:11,242][WARN ][o.e.m.j.JvmGcMonitorService] [datanode-2] [gc][5696286] overhead, spent [8.3s] collecting in the last [8.7s] [2018-10-04T20:17:12,242][INFO ][o.e.m.j.JvmGcMonitorService] [datanode-2] [gc][5696287] overhead, spent [307ms] collecting in the last [1s] [2018-10-04T20:41:42,897][WARN ][o.e.m.j.JvmGcMonitorService] [datanode-2] [gc][5697757] overhead, spent [753ms] collecting in the last [1.3s] [2018-10-04T20:41:43,897][INFO ][o.e.m.j.JvmGcMonitorService] [datanode-2] [gc][5697758] overhead, spent [430ms] collecting in the last [1s]

Questions I am hoping to try and answer:

I am assuming from the data I have gathered that I need to increase the heap size to reduce the garbage collection. Is this correct?
In order to increase the query performance, are there certain caches I should increase, given that the same query may only be run a couple of times to locate a specific document and never again, or not for hours/days
Given that it is likely the user issuing the query is looking for terms in the body of the documents, are there any settings I should consider to improve query performance from this perspective?

Many thanks in advance for any ideas/suggestions/comments.

Adam.

awebb · October 11, 2018, 6:54pm

Hi,

Is anyone able to provide some pointers on how to gain a deeper understanding of this issue? i.e. what metrics can I leverage to understand the GC behaviour when I am issuing specific queries and how this affects the FETCH phase performance.

If more information is needed to provide insight, please let me know.

Many thanks,

Adam.

system · November 8, 2018, 6:54pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
JVM heap usage is high Elasticsearch	7	1925	December 26, 2019
Jvm Heap Size & Indexing Perfmance Problem Elasticsearch	1	461	March 11, 2020
Garbage collection log messages, [monitor.jvm ... duration [2.2m] Elasticsearch	8	10997	July 6, 2017
Elasticsearch frequent grabage collection Elasticsearch	1	341	December 9, 2019
Elasticsearch garbage collection - is this a worry? Elasticsearch	2	5510	August 8, 2017

5.5.0 - JVM Heap and Garbage Collection issues when executing queries

Related topics