Hi all,
I'm trying to figure out why some of my elasticsearch cluster data nodes have frequent long GC Pauses, which in one month can actually cause around 1 or 2 daily Green -> Red State Changes due to "failed to ping" DataNode sort of incidents.
JVM: OpenJDK Runtime Environment (build 1.8.0_101-b13)
Elasticsearch version: 5.5.2
Nbr of Data Nodes: 6
Nbr of Master Nodes: 2
Nbr of Coordinating Nodes: 2
Java Opts for Data Nodes:
-Xms24g -Xmx24g -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+DisableExplicitGC -XX:+AlwaysPreTouch -server -Xss1m -Djava.awt.headless=true -Dfile.encoding=UTF-8 -Djna.nosys=true -Djdk.io.permissionsUseCanonicalPath=true -Dio.netty.noUnsafe=true -Dio.netty.noKeySetOptimization=true -Dlog4j.shutdownHookEnabled=false -Dlog4j2.disable.jmx=true -Dlog4j.skipJansi=true -XX:+HeapDumpOnOutOfMemoryError
Hardware Specs of Data Nodes: x86 VMs with 16 vCores, 48GB RAM and 1.25TB available for ES spread throughout 5 Physical Volumes
Documents: ~ 8.000.000.0000
Shards: ~ 10.000 (roughly 5k primary and 5k replica)
Indices: ~ 1.000
Volume: ~ 4.5TB
Sustained Index Rate (P+R): 15.000/s - 20.000/s
Context: This is a cluster that serves documents from around 60 data sources and therefore has a lot of different indexes and mappings. Unfortunately there are a lot of improvements to be made such as small shards plague (7.500 shards have less than 500MB), data nodes going beyond the rule of thumb for nbr of shards (~ 1.600/node) taking in account 24GB of Heap and no index mapping optimization whatsoever.
Here's an example of a DataNode stats on a Sunday (virtually 0 searches):
On this specific node I have evidence of many minute level GC Pauses:
2018-12-19T04:02:57,920 [gc][6110963] overhead, spent [1.2m] collecting in the last [1.2m]
2018-12-20T08:06:57,883 [gc][6211791] overhead, spent [1m] collecting in the last [1m
2018-12-20T17:18:28,841 [gc][6244736] overhead, spent [1.1m] collecting in the last [1.1m]
2018-12-21T16:14:47,651 [gc][6326964] overhead, spent [1.4m] collecting in the last [1.5m]
2018-12-21T20:21:20,460 [gc][6341484] overhead, spent [2.9m] collecting in the last [2.9m]
2018-12-21T22:41:03,973 [gc][6349648] overhead, spent [2.3m] collecting in the last [2.3m]
2018-12-22T00:40:14,904 [gc][6356343] overhead, spent [6.6m] collecting in the last [6.6m]
2018-12-22T02:44:56,330 [gc][6363693] overhead, spent [1.5m] collecting in the last [1.5m]
2018-12-22T04:01:00,736 [gc][6368150] overhead, spent [1.4m] collecting in the last [1.4m]
2018-12-22T04:51:03,506 [gc][6370787] overhead, spent [5.9m] collecting in the last [5.9m]
2018-12-22T08:41:52,865 [gc][6384154] overhead, spent [1.2m] collecting in the last [1.2m]
2018-12-22T19:15:19,185 [gc][6421845] overhead, spent [1m] collecting in the last [1m
2018-12-24T10:09:05,670 [gc][6561098] overhead, spent [2m] collecting in the last [2m
2018-12-26T00:16:33,479 [gc][6697879] overhead, spent [1.6m] collecting in the last [1.6m]
2018-12-26T02:24:00,858 [gc][6705353] overhead, spent [1m] collecting in the last [1m
2018-12-26T03:14:50,082 [gc][6708315] overhead, spent [1.3m] collecting in the last [1.3m]
2018-12-27T05:08:30,824 [gc][6801268] overhead, spent [59.3s] collecting in the last [1m]
Assuming that I'm not keen on scaling out such inefficient cluster with more nodes (throwing money at it) and, according to common sense, adding more RAM do the VMs (i.e: 64GB, upping ES Heap to 30GB) might even produce higher GC Pauses if nothing is done, I'm betting my chips on reducing ES heap usage by:
- Reducing the number of shards (mostly by adding more ambiguous time patterns on indexes that are currently producing small shards - i.e: from day pattern to month pattern, but as well reducing the number of shards of some indices and reducing the retention time for some)
- Removing a bunch of metadata (from LS, Beats, etc..) from the document data model
- Removing replicas for non-critical and sparsely searched indices
- Stop the analyzis of fields that are not used for full text search and indexing of misc fields which are only to exist in the document and will not be used for full text search and/or aggregation/sorting
- Figuring out why even with bootstrap.memory_lock: true I can still see a couple of hundred of MBs allocated to VmSwap (maybe I should just disable Swapping or enforce a stricter swappiness)
Does anyone been in a similar situation and has additional ideas as to what can be done to improve the ES cluster heap efficiency?
I'm also trying to figure out a couple of "optimal values" so I can decide on some KPIs that can evaluate the results of fixing all of this problems:
- For a 24-30 GB ES Heap, what is considered a healthy upper bound for GC Pause times?
- For a 24-30 GB ES Heap, what is considered a healthy interval between Old Gen GCs?