I'm running a very basic 2 node ES+Kibana cluster on AWS EC2 in a dev environment. I'm setting HEAP_SIZE to 2GB (on m3.medium instance, which itself is 3.75 GB RAM in total). I'm seeing a node start crawling, and spewing log messages like:
[2016-05-18 02:05:26,071][WARN ][monitor.jvm ] [i-0953799xxxxxxxxx] [gc][young][103][20] duration [1.5m], collections [1]/[1.5m], total [1.5m]/[14.4m], memory [1gb]->[1gb]/[1.9gb], all_pools {[young] [53.5mb]->[451.9kb]/[66.5mb]}{[survivor] [7.7mb]->[5.4mb]/[8.3mb]}{[old] [1gb]->[1gb]/[1.9gb]}
[2016-05-18 02:07:07,270][WARN ][monitor.jvm ] [i-0953799xxxxxxxxx] [gc][young][104][21] duration [1.6m], collections [1]/[1.6m], total [1.6m]/[16.1m], memory [1gb]->[1gb]/[1.9gb], all_pools {[young] [451.9kb]->[957.8kb]/[66.5mb]}{[survivor] [5.4mb]->[8.3mb]/[8.3mb]}{[old] [1gb]->[1gb]/[1.9gb]}
[2016-05-18 02:08:23,267][WARN ][monitor.jvm ] [i-0953799xxxxxxxxx] [gc][young][105][22] duration [1.2m], collections [1]/[1.2m], total [1.2m]/[17.3m], memory [1gb]->[1gb]/[1.9gb], all_pools {[young] [957.8kb]->[348.1kb]/[66.5mb]}{[survivor] [8.3mb]->[6mb]/[8.3mb]}{[old] [1gb]->[1gb]/[1.9gb]}
During this time, the JVM comes to a halt and effectively cannot be contacted by it peer, so the cluster falls into two. Sending a SIGTERM to the JVM doesn't exit, so its a SIGKILL and restart. The cluster then generally recovers to GREEN state. Yeah, that duration says "1.2m" - 80 seconds??!
This has been happening since ES 2.3.0, and is the same with ES 2.3.2. The indexes I have are pretty simple: no geo points, no ngrams being extracted, just fields, timestamp, and the odd message with a numeric field. The volume is low (its a dev environment and for 16 hours a day there are no messages). The occurrence of this issue can be from several seconds after the Elasticsearch process starts, to several hours or days.
Config is simple:
node.name: i-06864363xxxxxxxxxxx
path.data: /opt/elasticsearch/dev
path.logs: /var/log/elasticsearch
network.host:
- local
- site
cloud.aws.region: ap-southeast-2
cloud.node.auto_attributes: true
discovery.type: ec2
discovery.ec2.tag.Component: Elasticsearch
discovery.ec2.tag.Environment: dev
index.mapper.dynamic: true
indices.fielddata.cache.size: 80%
indices.store.throttle.max_bytes_per_sec: 200mb
action.disable_delete_all_indices: true
Template map is:
{"order":0,"template":"-","settings":{},"mappings":{"varlogmessages":{"
properties":{"path":{"index":"not_analyzed","type":"string"},"component":{"index
":"not_analyzed","type":"string"},"instance_id":{"index":"not_analyzed","type":"
string"},"release":{"index":"not_analyzed","type":"string"},"@version":{"type":"
string"},"host":{"index":"not_analyzed","type":"string"},"message":{"type":"string"},"type":{"type":"string"},"timestamp":{"index":"not_analyzed","type":"string"}}},"server":{"properties":{"severity":{"index":"not_analyzed","type":"string"},"release":{"index":"not_analyzed","type":"string"},"module":{"index":"not_analyzed","type":"string"},"thread":{"type":"string"},"message":{"type":"string"},"type":{"type":"string"},"tags":{"type":"string"},"path":{"index":"not_analyzed","type":"string"},"component":{"index":"not_analyzed","type":"string"},"instance_id":{"index":"not_analyzed","type":"string"},"@version":{"type":"string"},"host":{"index":"not_analyzed","type":"string"},"correlation_id":{"index":"not_analyzed","type":"string"},"time_elapsed_ms":{"type":"integer"},"class":{"index":"not_analyzed","type":"string"},"user":{"index":"not_analyzed","type":"string"},"timestamp":{"index":"not_analyzed","type":"string"}}},"jvmgclog":{"properties":{"fullheapsize":{"type":"long"},"release":{"index":"not_analyzed","type":"string"},"description":{"type":"string"},"message":{"type":"string"},"type":{"type":"string"},"duration":{"type":"double"},"path":{"index":"not_analyzed","type":"string"},"component":{"index":"not_analyzed","type":"string"},"instance_id":{"index":"not_analyzed","type":"string"},"heapuseaftergc":{"type":"long"},"@version":{"type":"string"},"host":{"index":"not_analyzed","type":"string"},"heapusebeforegc":{"type":"long"},"timestamp":{"index":"not_analyzed","type":"string"}}}},"aliases":{}}
There's only 500 MB of data in here. Anyone know why the old memory fills up so quickly and then GC's itself out of service?
Thanks in advance.