ES 2.3.x: jvm.monitor log shows GC taking > 1min, cluster breaks

I'm running a very basic 2 node ES+Kibana cluster on AWS EC2 in a dev environment. I'm setting HEAP_SIZE to 2GB (on m3.medium instance, which itself is 3.75 GB RAM in total). I'm seeing a node start crawling, and spewing log messages like:

[2016-05-18 02:05:26,071][WARN ][monitor.jvm ] [i-0953799xxxxxxxxx] [gc][young][103][20] duration [1.5m], collections [1]/[1.5m], total [1.5m]/[14.4m], memory [1gb]->[1gb]/[1.9gb], all_pools {[young] [53.5mb]->[451.9kb]/[66.5mb]}{[survivor] [7.7mb]->[5.4mb]/[8.3mb]}{[old] [1gb]->[1gb]/[1.9gb]}
[2016-05-18 02:07:07,270][WARN ][monitor.jvm ] [i-0953799xxxxxxxxx] [gc][young][104][21] duration [1.6m], collections [1]/[1.6m], total [1.6m]/[16.1m], memory [1gb]->[1gb]/[1.9gb], all_pools {[young] [451.9kb]->[957.8kb]/[66.5mb]}{[survivor] [5.4mb]->[8.3mb]/[8.3mb]}{[old] [1gb]->[1gb]/[1.9gb]}
[2016-05-18 02:08:23,267][WARN ][monitor.jvm ] [i-0953799xxxxxxxxx] [gc][young][105][22] duration [1.2m], collections [1]/[1.2m], total [1.2m]/[17.3m], memory [1gb]->[1gb]/[1.9gb], all_pools {[young] [957.8kb]->[348.1kb]/[66.5mb]}{[survivor] [8.3mb]->[6mb]/[8.3mb]}{[old] [1gb]->[1gb]/[1.9gb]}

During this time, the JVM comes to a halt and effectively cannot be contacted by it peer, so the cluster falls into two. Sending a SIGTERM to the JVM doesn't exit, so its a SIGKILL and restart. The cluster then generally recovers to GREEN state. Yeah, that duration says "1.2m" - 80 seconds??!

This has been happening since ES 2.3.0, and is the same with ES 2.3.2. The indexes I have are pretty simple: no geo points, no ngrams being extracted, just fields, timestamp, and the odd message with a numeric field. The volume is low (its a dev environment and for 16 hours a day there are no messages). The occurrence of this issue can be from several seconds after the ElasticSearch process starts, to several hours or days.

Config is simple:

node.name: i-06864363xxxxxxxxxxx
path.data: /opt/elasticsearch/dev
path.logs: /var/log/elasticsearch
network.host:

  • local
  • site
    cloud.aws.region: ap-southeast-2
    cloud.node.auto_attributes: true
    discovery.type: ec2
    discovery.ec2.tag.Component: ElasticSearch
    discovery.ec2.tag.Environment: dev
    index.mapper.dynamic: true
    indices.fielddata.cache.size: 80%
    indices.store.throttle.max_bytes_per_sec: 200mb
    action.disable_delete_all_indices: true

Template map is:

{"order":0,"template":"-","settings":{},"mappings":{"varlogmessages":{"
properties":{"path":{"index":"not_analyzed","type":"string"},"component":{"index
":"not_analyzed","type":"string"},"instance_id":{"index":"not_analyzed","type":"
string"},"release":{"index":"not_analyzed","type":"string"},"@version":{"type":"
string"},"host":{"index":"not_analyzed","type":"string"},"message":{"type":"string"},"type":{"type":"string"},"timestamp":{"index":"not_analyzed","type":"string"}}},"server":{"properties":{"severity":{"index":"not_analyzed","type":"string"},"release":{"index":"not_analyzed","type":"string"},"module":{"index":"not_analyzed","type":"string"},"thread":{"type":"string"},"message":{"type":"string"},"type":{"type":"string"},"tags":{"type":"string"},"path":{"index":"not_analyzed","type":"string"},"component":{"index":"not_analyzed","type":"string"},"instance_id":{"index":"not_analyzed","type":"string"},"@version":{"type":"string"},"host":{"index":"not_analyzed","type":"string"},"correlation_id":{"index":"not_analyzed","type":"string"},"time_elapsed_ms":{"type":"integer"},"class":{"index":"not_analyzed","type":"string"},"user":{"index":"not_analyzed","type":"string"},"timestamp":{"index":"not_analyzed","type":"string"}}},"jvmgclog":{"properties":{"fullheapsize":{"type":"long"},"release":{"index":"not_analyzed","type":"string"},"description":{"type":"string"},"message":{"type":"string"},"type":{"type":"string"},"duration":{"type":"double"},"path":{"index":"not_analyzed","type":"string"},"component":{"index":"not_analyzed","type":"string"},"instance_id":{"index":"not_analyzed","type":"string"},"heapuseaftergc":{"type":"long"},"@version":{"type":"string"},"host":{"index":"not_analyzed","type":"string"},"heapusebeforegc":{"type":"long"},"timestamp":{"index":"not_analyzed","type":"string"}}}},"aliases":{}}

There's only 500 MB of data in here. Anyone know why the old memory fills up so quickly and then GC's itself out of service?

Thanks in advance.

More info, java version is:
openjdk version "1.8.0_91"
OpenJDK Runtime Environment (build 1.8.0_91-b14)
OpenJDK 64-Bit Server VM (build 25.91-b14, mixed mode)