GC overhead each day at new index creation

Hi,

We use elasticsearch for logstash only.
We have a cluster of 8 nodes, everyday from midnight (when logstash creates a new index, default logstash-yyyy-mm-dd pattern), the whole cluster is getting crazy and start to use a lot of native memory.
In the past it was using too much memory and was getting killed by the kernel.
We've now mitigated it by adding -XX:MaxDirectMemorySize=3g and removing the DisableExplicitGC option. The processes don't get killed (or rarely) anymore, however it keeps doing a lot of GC at a point where it cannot do anything else. Not sending logs anymore stop the GC from happening. Sometime if we let it settle for 30mins and send logs again everything is okay, sometime it's not and takes forever to recover.

We split the cluster in 3 master nodes and 8 data nodes, the cluster state is now more stable as the master are not impacted anymore, but the data nodes are still struggling to do anything.

Stack :
ES 5.5
SearchGuard
OS : Amazon Linux
Java : 1.8.0_131
30G or 64G available be server (AWS EC2 i3.xlarge or i3.2xlarge)
About 250G logs per day
10 indices
90 fields

ES logs sample:

[2018-02-13T05:57:22,638][INFO ][o.e.m.j.JvmGcMonitorService] [gol-prodmgt-elasticsearch-xxxx] [gc][278217] overhead, spent [710ms] collecting in the last [1.4s]
[2018-02-13T05:57:24,312][INFO ][o.e.m.j.JvmGcMonitorService] [gol-prodmgt-elasticsearch-xxxx] [gc][278218] overhead, spent [736ms] collecting in the last [1.6s]
[2018-02-13T05:57:25,980][INFO ][o.e.m.j.JvmGcMonitorService] [gol-prodmgt-elasticsearch-xxxx] [gc][278219] overhead, spent [704ms] collecting in the last [1.6s]
[2018-02-13T05:57:27,662][INFO ][o.e.m.j.JvmGcMonitorService] [gol-prodmgt-elasticsearch-xxxx] [gc][278220] overhead, spent [696ms] collecting in the last [1.6s]
[2018-02-13T05:57:29,310][INFO ][o.e.m.j.JvmGcMonitorService] [gol-prodmgt-elasticsearch-xxxx] [gc][278221] overhead, spent [671ms] collecting in the last [1.6s]
[2018-02-13T05:57:30,815][INFO ][o.e.m.j.JvmGcMonitorService] [gol-prodmgt-elasticsearch-xxxx] [gc][278222] overhead, spent [673ms] collecting in the last [1.5s]
[2018-02-13T05:57:32,277][INFO ][o.e.m.j.JvmGcMonitorService] [gol-prodmgt-elasticsearch-xxxx] [gc][278223] overhead, spent [665ms] collecting in the last [1.4s]
[2018-02-13T05:57:33,735][INFO ][o.e.m.j.JvmGcMonitorService] [gol-prodmgt-elasticsearch-xxxx] [gc][278224] overhead, spent [675ms] collecting in the last [1.4s]
[2018-02-13T05:57:35,179][INFO ][o.e.m.j.JvmGcMonitorService] [gol-prodmgt-elasticsearch-xxxx] [gc][278225] overhead, spent [675ms] collecting in the last [1.4s]
[2018-02-13T05:57:36,652][INFO ][o.e.m.j.JvmGcMonitorService] [gol-prodmgt-elasticsearch-xxxx] [gc][278226] overhead, spent [678ms] collecting in the last [1.4s]
[2018-02-13T05:57:38,156][INFO ][o.e.m.j.JvmGcMonitorService] [gol-prodmgt-elasticsearch-xxxx] [gc][278227] overhead, spent [674ms] collecting in the last [1.5s]

Thread dump when it's working:
https://pastebin.com/DnRjUr9e

Thread dump when it's not working:
https://pastebin.com/W1PvaAbU

JVM configuration (on a 30G node):
-Xmx7646m
-Xms7646m
-XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=75
-XX:+UseCMSInitiatingOccupancyOnly
-XX:+AlwaysPreTouch
-server
-Djava.awt.headless=true
-Dfile.encoding=UTF-8
-Djna.nosys=true
-Dio.netty.noUnsafe=true
-Dio.netty.noKeySetOptimization=true
-Dlog4j.shutdownHookEnabled=false
-Dlog4j2.disable.jmx=true
-Dlog4j.skipJansi=true
-XX:+HeapDumpOnOutOfMemoryError
-Xloggc:/tmp/gc.log
-XX:+PrintGC
-XX:+PrintGCDetails
-XX:+PrintGCDateStamps
-XX:+PrintGCTimeStamps
-XX:+PrintGCTaskTimeStamps
-verbose:gc
-Djdk.nio.maxCachedBufferSize=262144
-XX:MaxDirectMemorySize=3g
(JMX)

ES configuration:
cluster.name: production
node.name: gol-prodmgt-elasticsearch-i-xxx
path.conf: "/etc/elasticsearch"
path.data: "/var/lib/elasticsearch"
path.logs: "/var/log/elasticsearch"
node.attr.az: eu-west-1c
node.master: false
cluster.routing.allocation.awareness.attributes: az
cloud.aws.region: eu-west-1
plugin.mandatory: discovery-ec2, repository-s3
discovery.type: ec2
discovery.ec2.groups: gol-prodmgt-elasticsearch-instance-sg
discovery.zen.minimum_master_nodes: 2
network.bind_host: 0.0.0.0
network.host: 10.x.x.x
http.cors.enabled: true
indices.memory.index_buffer_size: 30%
http.cors.allow-origin: "***"
bootstrap.system_call_filter: true
bootstrap.memory_lock: true
searchguard.ssl...

We ran out of idea on how to fix it. The thread dump while we have issues is showing a snapshot, but it's triggered 2 hours after the issue starts, and we can do snapshots without issues when every works well

Thanks a lot for your help!

After yesterday's restart, today we were able to recover by simply stopping and starting logstash.

We definitely want to avoid pro-active restarts. Would you have any idea to find the root cause of the issue?

Thank you!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.