5.5.0 - JVM Heap and Garbage Collection issues when executing queries

Hi,

Hoping someone can point me in the right direction with an issue related to JVM heap and garbage collection issues when executing search queries. Here's what I know so far:

Use case

  • Performing queries against an index which is populated with documents (word, excel, pdf, email etc...)
  • Elastic documents contain properties related to the document, plus a "content" field which contains the extracted content of the document (so, can get quite large depending on the document's content)
  • Queries are performed ad-hoc by users (currently a very low rate) and are free text searches i.e. find all documents containing certain words or phrases
  • Index rate for each index is rather low i.e. ~5000 an hour for an example index

Hardware/Elastic Setup

  • VMs running in Azure
  • 3 master, 3 client and 3 data nodes
  • Data nodes - 4 vCPU, 14Gb RAM
  • 7gb RAM allocated to JVM heap, swapping disabled, memory locked

Index Configuration

  • Index per tenant (~30 so far) - 10 primary shards, 1 replica, 30s index refresh
  • size ranges from 1gb to 400gb (I'm currently testing against a 50Gb index, which is the largest on the specific cluster it is hosted on)
  • 2 document types, 15 properties, most are keyword, text, Boolean, long, double and one nested property

Issue

  • When performing certain queries (that return a limited number of results i.e. terms are infrequent in the target index), the query will take 15-60+ seconds
  • In these cases, the data nodes normally log that they are performing garbage collection, which obviously is affecting the query performance
  • In the slow query log, the fetch phase for certain shards can range from ~200ms to 12+ seconds. Which is reflected by the node GC logging
  • We have had occasions where OOM exceptions are thrown
  • Mostly, Kibana shows that there is a massive drop in heap usage due to GC

Example

  • Below is an example query
  • The "accessControlLevels" terms filter is used to determine security requirements and includes multiple (27 in this case) entries that follow the below patterns
    "aduser:726877fb-ac7a-48fc-ac90-715a1a558f22"
    "adgroup:5eff07e9-388f-4219-bd6e-dc746b04fcd2"
    "email:someone@company.com"

[2018-10-04T20:41:54,177][WARN ][index.search.slowlog.fetch] [datanode-2] [fd23a7d3-362d-406b-82ba-7f7155734ab6_1002][2] took[12.7s], took_millis[12783], types[contentobject], stats[], search_type[QUERY_THEN_FETCH], total_shards[10], so urce[{"from":0,"size":25,"query":{"bool":{"must":[{"query_string":{"query":"abc def","fields":[],"use_dis_max":true,"tie_breaker":0.0,"default_operator":"and","auto_generate_phrase_queries":false,"max_determinized_states":10000,"enable_p osition_increments":true,"fuzziness":"AUTO","fuzzy_prefix_length":0,"fuzzy_max_expansions":50,"phrase_slop":0,"escape":false,"split_on_whitespace":true,"boost":1.0}}],"filter":[{"terms":{"accessControlLevels":[*see explanation*],"boost":1.0}}],"disable_coord":false,"adjust_pure_negative":true,"boost":1.0}},"_source":{"includes":["*"],"excludes":["content"]},"sort":[{"uniqueId":{"order":"asc"}}],"highlight":{"requi re_field_match":false,"fields":{"content":{"no_match_size":200}}}}],

  • Example node log GC entry below

    [2018-10-04T20:17:11,242][INFO ][o.e.m.j.JvmGcMonitorService] [datanode-2] [gc][old][5696286][1085] duration [8.3s], collections [1]/[8.7s], total [8.3s]/[22.9m], memory [6.4gb]->[2.7gb]/[6.8gb], all_pools {[young] [361.2kb]->[7.1mb]/[2 66.2mb]}{[survivor] [33.2mb]->[0b]/[33.2mb]}{[old] [6.4gb]->[2.7gb]/[6.5gb]} [2018-10-04T20:17:11,242][WARN ][o.e.m.j.JvmGcMonitorService] [datanode-2] [gc][5696286] overhead, spent [8.3s] collecting in the last [8.7s] [2018-10-04T20:17:12,242][INFO ][o.e.m.j.JvmGcMonitorService] [datanode-2] [gc][5696287] overhead, spent [307ms] collecting in the last [1s] [2018-10-04T20:41:42,897][WARN ][o.e.m.j.JvmGcMonitorService] [datanode-2] [gc][5697757] overhead, spent [753ms] collecting in the last [1.3s] [2018-10-04T20:41:43,897][INFO ][o.e.m.j.JvmGcMonitorService] [datanode-2] [gc][5697758] overhead, spent [430ms] collecting in the last [1s]

Questions I am hoping to try and answer:

  1. I am assuming from the data I have gathered that I need to increase the heap size to reduce the garbage collection. Is this correct?
  2. In order to increase the query performance, are there certain caches I should increase, given that the same query may only be run a couple of times to locate a specific document and never again, or not for hours/days
  3. Given that it is likely the user issuing the query is looking for terms in the body of the documents, are there any settings I should consider to improve query performance from this perspective?

Many thanks in advance for any ideas/suggestions/comments.

Adam.

Hi,

Is anyone able to provide some pointers on how to gain a deeper understanding of this issue? i.e. what metrics can I leverage to understand the GC behaviour when I am issuing specific queries and how this affects the FETCH phase performance.

If more information is needed to provide insight, please let me know.

Many thanks,

Adam.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.