Hi,
We use elasticsearch for logstash only.
We have a cluster of 8 nodes, everyday from midnight (when logstash creates a new index, default logstash-yyyy-mm-dd pattern), the whole cluster is getting crazy and start to use a lot of native memory.
In the past it was using too much memory and was getting killed by the kernel.
We've now mitigated it by adding -XX:MaxDirectMemorySize=3g and removing the DisableExplicitGC option. The processes don't get killed (or rarely) anymore, however it keeps doing a lot of GC at a point where it cannot do anything else. Not sending logs anymore stop the GC from happening. Sometime if we let it settle for 30mins and send logs again everything is okay, sometime it's not and takes forever to recover.
We split the cluster in 3 master nodes and 8 data nodes, the cluster state is now more stable as the master are not impacted anymore, but the data nodes are still struggling to do anything.
Stack :
ES 5.5
SearchGuard
OS : Amazon Linux
Java : 1.8.0_131
30G or 64G available be server (AWS EC2 i3.xlarge or i3.2xlarge)
About 250G logs per day
10 indices
90 fields
ES logs sample:
[2018-02-13T05:57:22,638][INFO ][o.e.m.j.JvmGcMonitorService] [gol-prodmgt-elasticsearch-xxxx] [gc][278217] overhead, spent [710ms] collecting in the last [1.4s]
[2018-02-13T05:57:24,312][INFO ][o.e.m.j.JvmGcMonitorService] [gol-prodmgt-elasticsearch-xxxx] [gc][278218] overhead, spent [736ms] collecting in the last [1.6s]
[2018-02-13T05:57:25,980][INFO ][o.e.m.j.JvmGcMonitorService] [gol-prodmgt-elasticsearch-xxxx] [gc][278219] overhead, spent [704ms] collecting in the last [1.6s]
[2018-02-13T05:57:27,662][INFO ][o.e.m.j.JvmGcMonitorService] [gol-prodmgt-elasticsearch-xxxx] [gc][278220] overhead, spent [696ms] collecting in the last [1.6s]
[2018-02-13T05:57:29,310][INFO ][o.e.m.j.JvmGcMonitorService] [gol-prodmgt-elasticsearch-xxxx] [gc][278221] overhead, spent [671ms] collecting in the last [1.6s]
[2018-02-13T05:57:30,815][INFO ][o.e.m.j.JvmGcMonitorService] [gol-prodmgt-elasticsearch-xxxx] [gc][278222] overhead, spent [673ms] collecting in the last [1.5s]
[2018-02-13T05:57:32,277][INFO ][o.e.m.j.JvmGcMonitorService] [gol-prodmgt-elasticsearch-xxxx] [gc][278223] overhead, spent [665ms] collecting in the last [1.4s]
[2018-02-13T05:57:33,735][INFO ][o.e.m.j.JvmGcMonitorService] [gol-prodmgt-elasticsearch-xxxx] [gc][278224] overhead, spent [675ms] collecting in the last [1.4s]
[2018-02-13T05:57:35,179][INFO ][o.e.m.j.JvmGcMonitorService] [gol-prodmgt-elasticsearch-xxxx] [gc][278225] overhead, spent [675ms] collecting in the last [1.4s]
[2018-02-13T05:57:36,652][INFO ][o.e.m.j.JvmGcMonitorService] [gol-prodmgt-elasticsearch-xxxx] [gc][278226] overhead, spent [678ms] collecting in the last [1.4s]
[2018-02-13T05:57:38,156][INFO ][o.e.m.j.JvmGcMonitorService] [gol-prodmgt-elasticsearch-xxxx] [gc][278227] overhead, spent [674ms] collecting in the last [1.5s]
Thread dump when it's working:
https://pastebin.com/DnRjUr9e
Thread dump when it's not working:
https://pastebin.com/W1PvaAbU
JVM configuration (on a 30G node):
-Xmx7646m
-Xms7646m
-XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=75
-XX:+UseCMSInitiatingOccupancyOnly
-XX:+AlwaysPreTouch
-server
-Djava.awt.headless=true
-Dfile.encoding=UTF-8
-Djna.nosys=true
-Dio.netty.noUnsafe=true
-Dio.netty.noKeySetOptimization=true
-Dlog4j.shutdownHookEnabled=false
-Dlog4j2.disable.jmx=true
-Dlog4j.skipJansi=true
-XX:+HeapDumpOnOutOfMemoryError
-Xloggc:/tmp/gc.log
-XX:+PrintGC
-XX:+PrintGCDetails
-XX:+PrintGCDateStamps
-XX:+PrintGCTimeStamps
-XX:+PrintGCTaskTimeStamps
-verbose:gc
-Djdk.nio.maxCachedBufferSize=262144
-XX:MaxDirectMemorySize=3g
(JMX)
ES configuration:
cluster.name: production
node.name: gol-prodmgt-elasticsearch-i-xxx
path.conf: "/etc/elasticsearch"
path.data: "/var/lib/elasticsearch"
path.logs: "/var/log/elasticsearch"
node.attr.az: eu-west-1c
node.master: false
cluster.routing.allocation.awareness.attributes: az
cloud.aws.region: eu-west-1
plugin.mandatory: discovery-ec2, repository-s3
discovery.type: ec2
discovery.ec2.groups: gol-prodmgt-elasticsearch-instance-sg
discovery.zen.minimum_master_nodes: 2
network.bind_host: 0.0.0.0
network.host: 10.x.x.x
http.cors.enabled: true
indices.memory.index_buffer_size: 30%
http.cors.allow-origin: "***"
bootstrap.system_call_filter: true
bootstrap.memory_lock: true
searchguard.ssl...
We ran out of idea on how to fix it. The thread dump while we have issues is showing a snapshot, but it's triggered 2 hours after the issue starts, and we can do snapshots without issues when every works well
Thanks a lot for your help!