5.5.0 - JVM Heap and Garbage Collection issues when executing queries


#1

Hi,

Hoping someone can point me in the right direction with an issue related to JVM heap and garbage collection issues when executing search queries. Here's what I know so far:

Use case

  • Performing queries against an index which is populated with documents (word, excel, pdf, email etc...)
  • Elastic documents contain properties related to the document, plus a "content" field which contains the extracted content of the document (so, can get quite large depending on the document's content)
  • Queries are performed ad-hoc by users (currently a very low rate) and are free text searches i.e. find all documents containing certain words or phrases
  • Index rate for each index is rather low i.e. ~5000 an hour for an example index

Hardware/Elastic Setup

  • VMs running in Azure
  • 3 master, 3 client and 3 data nodes
  • Data nodes - 4 vCPU, 14Gb RAM
  • 7gb RAM allocated to JVM heap, swapping disabled, memory locked

Index Configuration

  • Index per tenant (~30 so far) - 10 primary shards, 1 replica, 30s index refresh
  • size ranges from 1gb to 400gb (I'm currently testing against a 50Gb index, which is the largest on the specific cluster it is hosted on)
  • 2 document types, 15 properties, most are keyword, text, Boolean, long, double and one nested property

Issue

  • When performing certain queries (that return a limited number of results i.e. terms are infrequent in the target index), the query will take 15-60+ seconds
  • In these cases, the data nodes normally log that they are performing garbage collection, which obviously is affecting the query performance
  • In the slow query log, the fetch phase for certain shards can range from ~200ms to 12+ seconds. Which is reflected by the node GC logging
  • We have had occasions where OOM exceptions are thrown
  • Mostly, Kibana shows that there is a massive drop in heap usage due to GC

Example

  • Below is an example query
  • The "accessControlLevels" terms filter is used to determine security requirements and includes multiple (27 in this case) entries that follow the below patterns
    "aduser:726877fb-ac7a-48fc-ac90-715a1a558f22"
    "adgroup:5eff07e9-388f-4219-bd6e-dc746b04fcd2"
    "email:someone@company.com"

[2018-10-04T20:41:54,177][WARN ][index.search.slowlog.fetch] [datanode-2] [fd23a7d3-362d-406b-82ba-7f7155734ab6_1002][2] took[12.7s], took_millis[12783], types[contentobject], stats[], search_type[QUERY_THEN_FETCH], total_shards[10], so urce[{"from":0,"size":25,"query":{"bool":{"must":[{"query_string":{"query":"abc def","fields":[],"use_dis_max":true,"tie_breaker":0.0,"default_operator":"and","auto_generate_phrase_queries":false,"max_determinized_states":10000,"enable_p osition_increments":true,"fuzziness":"AUTO","fuzzy_prefix_length":0,"fuzzy_max_expansions":50,"phrase_slop":0,"escape":false,"split_on_whitespace":true,"boost":1.0}}],"filter":[{"terms":{"accessControlLevels":[*see explanation*],"boost":1.0}}],"disable_coord":false,"adjust_pure_negative":true,"boost":1.0}},"_source":{"includes":["*"],"excludes":["content"]},"sort":[{"uniqueId":{"order":"asc"}}],"highlight":{"requi re_field_match":false,"fields":{"content":{"no_match_size":200}}}}],

  • Example node log GC entry below

    [2018-10-04T20:17:11,242][INFO ][o.e.m.j.JvmGcMonitorService] [datanode-2] [gc][old][5696286][1085] duration [8.3s], collections [1]/[8.7s], total [8.3s]/[22.9m], memory [6.4gb]->[2.7gb]/[6.8gb], all_pools {[young] [361.2kb]->[7.1mb]/[2 66.2mb]}{[survivor] [33.2mb]->[0b]/[33.2mb]}{[old] [6.4gb]->[2.7gb]/[6.5gb]} [2018-10-04T20:17:11,242][WARN ][o.e.m.j.JvmGcMonitorService] [datanode-2] [gc][5696286] overhead, spent [8.3s] collecting in the last [8.7s] [2018-10-04T20:17:12,242][INFO ][o.e.m.j.JvmGcMonitorService] [datanode-2] [gc][5696287] overhead, spent [307ms] collecting in the last [1s] [2018-10-04T20:41:42,897][WARN ][o.e.m.j.JvmGcMonitorService] [datanode-2] [gc][5697757] overhead, spent [753ms] collecting in the last [1.3s] [2018-10-04T20:41:43,897][INFO ][o.e.m.j.JvmGcMonitorService] [datanode-2] [gc][5697758] overhead, spent [430ms] collecting in the last [1s]

Questions I am hoping to try and answer:

  1. I am assuming from the data I have gathered that I need to increase the heap size to reduce the garbage collection. Is this correct?
  2. In order to increase the query performance, are there certain caches I should increase, given that the same query may only be run a couple of times to locate a specific document and never again, or not for hours/days
  3. Given that it is likely the user issuing the query is looking for terms in the body of the documents, are there any settings I should consider to improve query performance from this perspective?

Many thanks in advance for any ideas/suggestions/comments.

Adam.


#2

Hi,

Is anyone able to provide some pointers on how to gain a deeper understanding of this issue? i.e. what metrics can I leverage to understand the GC behaviour when I am issuing specific queries and how this affects the FETCH phase performance.

If more information is needed to provide insight, please let me know.

Many thanks,

Adam.


(system) #3

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.