Finding why long GCs occur and fixing efficiency issues in Elastic Cluster

Hi all,

I'm trying to figure out why some of my elasticsearch cluster data nodes have frequent long GC Pauses, which in one month can actually cause around 1 or 2 daily Green -> Red State Changes due to "failed to ping" DataNode sort of incidents.

JVM: OpenJDK Runtime Environment (build 1.8.0_101-b13)
Elasticsearch version: 5.5.2
Nbr of Data Nodes: 6
Nbr of Master Nodes: 2
Nbr of Coordinating Nodes: 2

Java Opts for Data Nodes:

-Xms24g -Xmx24g -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+DisableExplicitGC -XX:+AlwaysPreTouch -server -Xss1m -Djava.awt.headless=true -Dfile.encoding=UTF-8 -Djna.nosys=true -Djdk.io.permissionsUseCanonicalPath=true -Dio.netty.noUnsafe=true -Dio.netty.noKeySetOptimization=true -Dlog4j.shutdownHookEnabled=false -Dlog4j2.disable.jmx=true -Dlog4j.skipJansi=true -XX:+HeapDumpOnOutOfMemoryError

Hardware Specs of Data Nodes: x86 VMs with 16 vCores, 48GB RAM and 1.25TB available for ES spread throughout 5 Physical Volumes

Documents: ~ 8.000.000.0000
Shards: ~ 10.000 (roughly 5k primary and 5k replica)
Indices: ~ 1.000
Volume: ~ 4.5TB
Sustained Index Rate (P+R): 15.000/s - 20.000/s

Context: This is a cluster that serves documents from around 60 data sources and therefore has a lot of different indexes and mappings. Unfortunately there are a lot of improvements to be made such as small shards plague (7.500 shards have less than 500MB), data nodes going beyond the rule of thumb for nbr of shards (~ 1.600/node) taking in account 24GB of Heap and no index mapping optimization whatsoever.

Here's an example of a DataNode stats on a Sunday (virtually 0 searches):

On this specific node I have evidence of many minute level GC Pauses:

2018-12-19T04:02:57,920 [gc][6110963] overhead, spent [1.2m] collecting in the last [1.2m]
2018-12-20T08:06:57,883 [gc][6211791] overhead, spent [1m] collecting in the last [1m
2018-12-20T17:18:28,841 [gc][6244736] overhead, spent [1.1m] collecting in the last [1.1m]
2018-12-21T16:14:47,651 [gc][6326964] overhead, spent [1.4m] collecting in the last [1.5m]
2018-12-21T20:21:20,460 [gc][6341484] overhead, spent [2.9m] collecting in the last [2.9m]
2018-12-21T22:41:03,973 [gc][6349648] overhead, spent [2.3m] collecting in the last [2.3m]
2018-12-22T00:40:14,904 [gc][6356343] overhead, spent [6.6m] collecting in the last [6.6m]
2018-12-22T02:44:56,330 [gc][6363693] overhead, spent [1.5m] collecting in the last [1.5m]
2018-12-22T04:01:00,736 [gc][6368150] overhead, spent [1.4m] collecting in the last [1.4m]
2018-12-22T04:51:03,506 [gc][6370787] overhead, spent [5.9m] collecting in the last [5.9m]
2018-12-22T08:41:52,865 [gc][6384154] overhead, spent [1.2m] collecting in the last [1.2m]
2018-12-22T19:15:19,185 [gc][6421845] overhead, spent [1m] collecting in the last [1m
2018-12-24T10:09:05,670 [gc][6561098] overhead, spent [2m] collecting in the last [2m
2018-12-26T00:16:33,479 [gc][6697879] overhead, spent [1.6m] collecting in the last [1.6m]
2018-12-26T02:24:00,858 [gc][6705353] overhead, spent [1m] collecting in the last [1m
2018-12-26T03:14:50,082 [gc][6708315] overhead, spent [1.3m] collecting in the last [1.3m]
2018-12-27T05:08:30,824 [gc][6801268] overhead, spent [59.3s] collecting in the last [1m]

Assuming that I'm not keen on scaling out such inefficient cluster with more nodes (throwing money at it) and, according to common sense, adding more RAM do the VMs (i.e: 64GB, upping ES Heap to 30GB) might even produce higher GC Pauses if nothing is done, I'm betting my chips on reducing ES heap usage by:

  1. Reducing the number of shards (mostly by adding more ambiguous time patterns on indexes that are currently producing small shards - i.e: from day pattern to month pattern, but as well reducing the number of shards of some indices and reducing the retention time for some)
  2. Removing a bunch of metadata (from LS, Beats, etc..) from the document data model
  3. Removing replicas for non-critical and sparsely searched indices
  4. Stop the analyzis of fields that are not used for full text search and indexing of misc fields which are only to exist in the document and will not be used for full text search and/or aggregation/sorting
  5. Figuring out why even with bootstrap.memory_lock: true I can still see a couple of hundred of MBs allocated to VmSwap (maybe I should just disable Swapping or enforce a stricter swappiness)

Does anyone been in a similar situation and has additional ideas as to what can be done to improve the ES cluster heap efficiency?

I'm also trying to figure out a couple of "optimal values" so I can decide on some KPIs that can evaluate the results of fixing all of this problems:

  1. For a 24-30 GB ES Heap, what is considered a healthy upper bound for GC Pause times?
  2. For a 24-30 GB ES Heap, what is considered a healthy interval between Old Gen GCs?

Did you try to reduce the heap size instead of increasing it? It seems like you have quite a lot of heap and not much of it is utilized and under heavy load it might accumulate quite a bit of garbage.

Indeed some Old Gen GCs are able to release almost half of used JVM Heap (accumulation?). To what heap size would you fallback to in this case? 16GB?

When I look at the heap graph that you posted I see pretty healthy sawtooth pattern going between 10G and 17G. So, I feel like 16G would be pretty safe first iteration after which I would suggest checking how the system is behaving and possibly trying again. From the graph that you posted I see that lucene is using less than 2.5G and I don't see any other heap users, so it's hard to tell how low you can go.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.