GC overhead each day at new index creation

jujugrrr · February 13, 2018, 4:33pm

Hi,

We use elasticsearch for logstash only.
We have a cluster of 8 nodes, everyday from midnight (when logstash creates a new index, default logstash-yyyy-mm-dd pattern), the whole cluster is getting crazy and start to use a lot of native memory.
In the past it was using too much memory and was getting killed by the kernel.
We've now mitigated it by adding -XX:MaxDirectMemorySize=3g and removing the DisableExplicitGC option. The processes don't get killed (or rarely) anymore, however it keeps doing a lot of GC at a point where it cannot do anything else. Not sending logs anymore stop the GC from happening. Sometime if we let it settle for 30mins and send logs again everything is okay, sometime it's not and takes forever to recover.

We split the cluster in 3 master nodes and 8 data nodes, the cluster state is now more stable as the master are not impacted anymore, but the data nodes are still struggling to do anything.

Stack :
ES 5.5
SearchGuard
OS : Amazon Linux
Java : 1.8.0_131
30G or 64G available be server (AWS EC2 i3.xlarge or i3.2xlarge)
About 250G logs per day
10 indices
90 fields

ES logs sample:

[2018-02-13T05:57:22,638][INFO ][o.e.m.j.JvmGcMonitorService] [gol-prodmgt-elasticsearch-xxxx] [gc][278217] overhead, spent [710ms] collecting in the last [1.4s]
[2018-02-13T05:57:24,312][INFO ][o.e.m.j.JvmGcMonitorService] [gol-prodmgt-elasticsearch-xxxx] [gc][278218] overhead, spent [736ms] collecting in the last [1.6s]
[2018-02-13T05:57:25,980][INFO ][o.e.m.j.JvmGcMonitorService] [gol-prodmgt-elasticsearch-xxxx] [gc][278219] overhead, spent [704ms] collecting in the last [1.6s]
[2018-02-13T05:57:27,662][INFO ][o.e.m.j.JvmGcMonitorService] [gol-prodmgt-elasticsearch-xxxx] [gc][278220] overhead, spent [696ms] collecting in the last [1.6s]
[2018-02-13T05:57:29,310][INFO ][o.e.m.j.JvmGcMonitorService] [gol-prodmgt-elasticsearch-xxxx] [gc][278221] overhead, spent [671ms] collecting in the last [1.6s]
[2018-02-13T05:57:30,815][INFO ][o.e.m.j.JvmGcMonitorService] [gol-prodmgt-elasticsearch-xxxx] [gc][278222] overhead, spent [673ms] collecting in the last [1.5s]
[2018-02-13T05:57:32,277][INFO ][o.e.m.j.JvmGcMonitorService] [gol-prodmgt-elasticsearch-xxxx] [gc][278223] overhead, spent [665ms] collecting in the last [1.4s]
[2018-02-13T05:57:33,735][INFO ][o.e.m.j.JvmGcMonitorService] [gol-prodmgt-elasticsearch-xxxx] [gc][278224] overhead, spent [675ms] collecting in the last [1.4s]
[2018-02-13T05:57:35,179][INFO ][o.e.m.j.JvmGcMonitorService] [gol-prodmgt-elasticsearch-xxxx] [gc][278225] overhead, spent [675ms] collecting in the last [1.4s]
[2018-02-13T05:57:36,652][INFO ][o.e.m.j.JvmGcMonitorService] [gol-prodmgt-elasticsearch-xxxx] [gc][278226] overhead, spent [678ms] collecting in the last [1.4s]
[2018-02-13T05:57:38,156][INFO ][o.e.m.j.JvmGcMonitorService] [gol-prodmgt-elasticsearch-xxxx] [gc][278227] overhead, spent [674ms] collecting in the last [1.5s]

Thread dump when it's working:
https://pastebin.com/DnRjUr9e

Thread dump when it's not working:
https://pastebin.com/W1PvaAbU

JVM configuration (on a 30G node):
-Xmx7646m
-Xms7646m
-XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=75
-XX:+UseCMSInitiatingOccupancyOnly
-XX:+AlwaysPreTouch
-server
-Djava.awt.headless=true
-Dfile.encoding=UTF-8
-Djna.nosys=true
-Dio.netty.noUnsafe=true
-Dio.netty.noKeySetOptimization=true
-Dlog4j.shutdownHookEnabled=false
-Dlog4j2.disable.jmx=true
-Dlog4j.skipJansi=true
-XX:+HeapDumpOnOutOfMemoryError
-Xloggc:/tmp/gc.log
-XX:+PrintGC
-XX:+PrintGCDetails
-XX:+PrintGCDateStamps
-XX:+PrintGCTimeStamps
-XX:+PrintGCTaskTimeStamps
-verbose:gc
-Djdk.nio.maxCachedBufferSize=262144
-XX:MaxDirectMemorySize=3g
(JMX)

ES configuration:
cluster.name: production
node.name: gol-prodmgt-elasticsearch-i-xxx
path.conf: "/etc/elasticsearch"
path.data: "/var/lib/elasticsearch"
path.logs: "/var/log/elasticsearch"
node.attr.az: eu-west-1c
node.master: false
cluster.routing.allocation.awareness.attributes: az
cloud.aws.region: eu-west-1
plugin.mandatory: discovery-ec2, repository-s3
discovery.type: ec2
discovery.ec2.groups: gol-prodmgt-elasticsearch-instance-sg
discovery.zen.minimum_master_nodes: 2
network.bind_host: 0.0.0.0
network.host: 10.x.x.x
http.cors.enabled: true
indices.memory.index_buffer_size: 30%
http.cors.allow-origin: "***"
bootstrap.system_call_filter: true
bootstrap.memory_lock: true
searchguard.ssl...

We ran out of idea on how to fix it. The thread dump while we have issues is showing a snapshot, but it's triggered 2 hours after the issue starts, and we can do snapshots without issues when every works well

Thanks a lot for your help!

jujugrrr · February 16, 2018, 9:20am

After yesterday's restart, today we were able to recover by simply stopping and starting logstash.

We definitely want to avoid pro-active restarts. Would you have any idea to find the root cause of the issue?

Thank you!

system · March 16, 2018, 9:20am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Elasticsearch overhead all the time Elasticsearch	6	2739	December 14, 2018
Garbage collection log messages, [monitor.jvm ... duration [2.2m] Elasticsearch	8	11045	July 6, 2017
OutOfMemory exception after few hours of indexing Elasticsearch	6	1893	July 6, 2017
Memory problems during data index Elasticsearch	13	1556	July 6, 2017
Frequent GC in elasticsearch Elasticsearch	9	7273	July 5, 2017

GC overhead each day at new index creation

Related topics