Hi,
Hoping someone can point me in the right direction with an issue related to JVM heap and garbage collection issues when executing search queries. Here's what I know so far:
Use case
- Performing queries against an index which is populated with documents (word, excel, pdf, email etc...)
- Elastic documents contain properties related to the document, plus a "content" field which contains the extracted content of the document (so, can get quite large depending on the document's content)
- Queries are performed ad-hoc by users (currently a very low rate) and are free text searches i.e. find all documents containing certain words or phrases
- Index rate for each index is rather low i.e. ~5000 an hour for an example index
Hardware/Elastic Setup
- VMs running in Azure
- 3 master, 3 client and 3 data nodes
- Data nodes - 4 vCPU, 14Gb RAM
- 7gb RAM allocated to JVM heap, swapping disabled, memory locked
Index Configuration
- Index per tenant (~30 so far) - 10 primary shards, 1 replica, 30s index refresh
- size ranges from 1gb to 400gb (I'm currently testing against a 50Gb index, which is the largest on the specific cluster it is hosted on)
- 2 document types, 15 properties, most are keyword, text, Boolean, long, double and one nested property
Issue
- When performing certain queries (that return a limited number of results i.e. terms are infrequent in the target index), the query will take 15-60+ seconds
- In these cases, the data nodes normally log that they are performing garbage collection, which obviously is affecting the query performance
- In the slow query log, the fetch phase for certain shards can range from ~200ms to 12+ seconds. Which is reflected by the node GC logging
- We have had occasions where OOM exceptions are thrown
- Mostly, Kibana shows that there is a massive drop in heap usage due to GC
Example
- Below is an example query
- The "accessControlLevels" terms filter is used to determine security requirements and includes multiple (27 in this case) entries that follow the below patterns
"aduser:726877fb-ac7a-48fc-ac90-715a1a558f22"
"adgroup:5eff07e9-388f-4219-bd6e-dc746b04fcd2"
"email:someone@company.com"
[2018-10-04T20:41:54,177][WARN ][index.search.slowlog.fetch] [datanode-2] [fd23a7d3-362d-406b-82ba-7f7155734ab6_1002][2] took[12.7s], took_millis[12783], types[contentobject], stats[], search_type[QUERY_THEN_FETCH], total_shards[10], so urce[{"from":0,"size":25,"query":{"bool":{"must":[{"query_string":{"query":"abc def","fields":[],"use_dis_max":true,"tie_breaker":0.0,"default_operator":"and","auto_generate_phrase_queries":false,"max_determinized_states":10000,"enable_p osition_increments":true,"fuzziness":"AUTO","fuzzy_prefix_length":0,"fuzzy_max_expansions":50,"phrase_slop":0,"escape":false,"split_on_whitespace":true,"boost":1.0}}],"filter":[{"terms":{"accessControlLevels":[*see explanation*],"boost":1.0}}],"disable_coord":false,"adjust_pure_negative":true,"boost":1.0}},"_source":{"includes":["*"],"excludes":["content"]},"sort":[{"uniqueId":{"order":"asc"}}],"highlight":{"requi re_field_match":false,"fields":{"content":{"no_match_size":200}}}}],
-
Example node log GC entry below
[2018-10-04T20:17:11,242][INFO ][o.e.m.j.JvmGcMonitorService] [datanode-2] [gc][old][5696286][1085] duration [8.3s], collections [1]/[8.7s], total [8.3s]/[22.9m], memory [6.4gb]->[2.7gb]/[6.8gb], all_pools {[young] [361.2kb]->[7.1mb]/[2 66.2mb]}{[survivor] [33.2mb]->[0b]/[33.2mb]}{[old] [6.4gb]->[2.7gb]/[6.5gb]} [2018-10-04T20:17:11,242][WARN ][o.e.m.j.JvmGcMonitorService] [datanode-2] [gc][5696286] overhead, spent [8.3s] collecting in the last [8.7s] [2018-10-04T20:17:12,242][INFO ][o.e.m.j.JvmGcMonitorService] [datanode-2] [gc][5696287] overhead, spent [307ms] collecting in the last [1s] [2018-10-04T20:41:42,897][WARN ][o.e.m.j.JvmGcMonitorService] [datanode-2] [gc][5697757] overhead, spent [753ms] collecting in the last [1.3s] [2018-10-04T20:41:43,897][INFO ][o.e.m.j.JvmGcMonitorService] [datanode-2] [gc][5697758] overhead, spent [430ms] collecting in the last [1s]
Questions I am hoping to try and answer:
- I am assuming from the data I have gathered that I need to increase the heap size to reduce the garbage collection. Is this correct?
- In order to increase the query performance, are there certain caches I should increase, given that the same query may only be run a couple of times to locate a specific document and never again, or not for hours/days
- Given that it is likely the user issuing the query is looking for terms in the body of the documents, are there any settings I should consider to improve query performance from this perspective?
Many thanks in advance for any ideas/suggestions/comments.
Adam.