ES 2.3.x: jvm.monitor log shows GC taking > 1min, cluster breaks

james.bromberger · May 18, 2016, 3:01am

I'm running a very basic 2 node ES+Kibana cluster on AWS EC2 in a dev environment. I'm setting HEAP_SIZE to 2GB (on m3.medium instance, which itself is 3.75 GB RAM in total). I'm seeing a node start crawling, and spewing log messages like:

[2016-05-18 02:05:26,071][WARN ][monitor.jvm ] [i-0953799xxxxxxxxx] [gc][young][103][20] duration [1.5m], collections [1]/[1.5m], total [1.5m]/[14.4m], memory [1gb]->[1gb]/[1.9gb], all_pools {[young] [53.5mb]->[451.9kb]/[66.5mb]}{[survivor] [7.7mb]->[5.4mb]/[8.3mb]}{[old] [1gb]->[1gb]/[1.9gb]}
[2016-05-18 02:07:07,270][WARN ][monitor.jvm ] [i-0953799xxxxxxxxx] [gc][young][104][21] duration [1.6m], collections [1]/[1.6m], total [1.6m]/[16.1m], memory [1gb]->[1gb]/[1.9gb], all_pools {[young] [451.9kb]->[957.8kb]/[66.5mb]}{[survivor] [5.4mb]->[8.3mb]/[8.3mb]}{[old] [1gb]->[1gb]/[1.9gb]}
[2016-05-18 02:08:23,267][WARN ][monitor.jvm ] [i-0953799xxxxxxxxx] [gc][young][105][22] duration [1.2m], collections [1]/[1.2m], total [1.2m]/[17.3m], memory [1gb]->[1gb]/[1.9gb], all_pools {[young] [957.8kb]->[348.1kb]/[66.5mb]}{[survivor] [8.3mb]->[6mb]/[8.3mb]}{[old] [1gb]->[1gb]/[1.9gb]}

During this time, the JVM comes to a halt and effectively cannot be contacted by it peer, so the cluster falls into two. Sending a SIGTERM to the JVM doesn't exit, so its a SIGKILL and restart. The cluster then generally recovers to GREEN state. Yeah, that duration says "1.2m" - 80 seconds??!

This has been happening since ES 2.3.0, and is the same with ES 2.3.2. The indexes I have are pretty simple: no geo points, no ngrams being extracted, just fields, timestamp, and the odd message with a numeric field. The volume is low (its a dev environment and for 16 hours a day there are no messages). The occurrence of this issue can be from several seconds after the Elasticsearch process starts, to several hours or days.

Config is simple:

node.name: i-06864363xxxxxxxxxxx
path.data: /opt/elasticsearch/dev
path.logs: /var/log/elasticsearch
network.host:

local

site
cloud.aws.region: ap-southeast-2
cloud.node.auto_attributes: true
discovery.type: ec2
discovery.ec2.tag.Component: Elasticsearch
discovery.ec2.tag.Environment: dev
index.mapper.dynamic: true
indices.fielddata.cache.size: 80%
indices.store.throttle.max_bytes_per_sec: 200mb
action.disable_delete_all_indices: true

Template map is:

{"order":0,"template":"-","settings":{},"mappings":{"varlogmessages":{"
properties":{"path":{"index":"not_analyzed","type":"string"},"component":{"index
":"not_analyzed","type":"string"},"instance_id":{"index":"not_analyzed","type":"
string"},"release":{"index":"not_analyzed","type":"string"},"@version":{"type":"
string"},"host":{"index":"not_analyzed","type":"string"},"message":{"type":"string"},"type":{"type":"string"},"timestamp":{"index":"not_analyzed","type":"string"}}},"server":{"properties":{"severity":{"index":"not_analyzed","type":"string"},"release":{"index":"not_analyzed","type":"string"},"module":{"index":"not_analyzed","type":"string"},"thread":{"type":"string"},"message":{"type":"string"},"type":{"type":"string"},"tags":{"type":"string"},"path":{"index":"not_analyzed","type":"string"},"component":{"index":"not_analyzed","type":"string"},"instance_id":{"index":"not_analyzed","type":"string"},"@version":{"type":"string"},"host":{"index":"not_analyzed","type":"string"},"correlation_id":{"index":"not_analyzed","type":"string"},"time_elapsed_ms":{"type":"integer"},"class":{"index":"not_analyzed","type":"string"},"user":{"index":"not_analyzed","type":"string"},"timestamp":{"index":"not_analyzed","type":"string"}}},"jvmgclog":{"properties":{"fullheapsize":{"type":"long"},"release":{"index":"not_analyzed","type":"string"},"description":{"type":"string"},"message":{"type":"string"},"type":{"type":"string"},"duration":{"type":"double"},"path":{"index":"not_analyzed","type":"string"},"component":{"index":"not_analyzed","type":"string"},"instance_id":{"index":"not_analyzed","type":"string"},"heapuseaftergc":{"type":"long"},"@version":{"type":"string"},"host":{"index":"not_analyzed","type":"string"},"heapusebeforegc":{"type":"long"},"timestamp":{"index":"not_analyzed","type":"string"}}}},"aliases":{}}

There's only 500 MB of data in here. Anyone know why the old memory fills up so quickly and then GC's itself out of service?

Thanks in advance.

james.bromberger · May 19, 2016, 4:42am

More info, java version is:
openjdk version "1.8.0_91"
OpenJDK Runtime Environment (build 1.8.0_91-b14)
OpenJDK 64-Bit Server VM (build 25.91-b14, mixed mode)

Topic		Replies	Views
ES 2.2.0 JVM overflow Elasticsearch	4	537	July 5, 2017
ElasticSearch gc performance on cluster Elasticsearch	3	655	July 5, 2017
WARN Monitor.jvm Elasticsearch	4	4345	July 5, 2017
Long running GC, cluster status RED, only few GB's data Elasticsearch	12	2527	July 5, 2017
ElastisSearch JVM Heap running full Elasticsearch	6	510	March 6, 2022

ES 2.3.x: jvm.monitor log shows GC taking > 1min, cluster breaks

Related topics