Frequent GC brings down cluster without obvious load

Hi experts,
We have a cluster with 10 nodes which used to work very well. Recently, the cluster become unstable, after some debugging, we found that JVM can go into an unstable state some times without obvious load.
JVM heap usage will go up very quickly and trigger heavy GC and then go back again. This process continuous until at a certain time, the node completely hang up and causing shard rebalancing. However, sometimes, the node can "go back alive" for some seconds causing another round of shard rebalancing. Shard rebalancing continuous until either bring the whole cluster unresponsive or calm down if one or two nodes triggers JVM OOM.
During the time, no obvious high workload, no slow query logged, only lots of lots slow GC logs.

Below are some screenshots from bigdesk. I'm not sure what I should paste him for experts, please just let me know if you want any additional information.


Hi @niulin,

I've been experiencing such issues and have been conversing with the community for help.
You may benefit by looking/following this thread.

We've been running into a similar issue recently -- and I don't think it has to do with indexing data as @Srinath_C's thread link suggests.

In our case we're running 4 x c3.8xlarge instances with 10TB of SSD-backed provisioned-iops volumes attached to each host (8000 IOPS per host). We index upwards of ~20k events per second peak during the day, and our indexed documents range quite a bit in field types, counts, etc. We have 331 indexed fields in our current day of events, for example.

(We store 30 days of data, at ~300GB/day for a total of ~9-12TB of used storage and ~9 billion documents)

Our hosts have 60+GB of memory, and we allocate 29.5GB of that to the HEAP on-startup, leaving ~30GB for lucene indexing and other local processes.

/usr/lib/jvm/java-7-oracle/bin/java -Xms29696m -Xmx29696m -Djava.awt.headless=true -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+HeapDumpOnOutOfMemoryError -XX:+DisableExplicitGC

Our elasticsearch.yml file:

  mlockall: true
    access_key: XXX
    region: us-east-1
    secret_key: XXX
    auto_attributes: true
  name: flume-elasticsearch-production_vpc-useast1
      allow_rebalance: indices_all_active
      cluster_concurrent_rebalance: 2
      node_concurrent_recoveries: 8
        host: true
    any_group: false
          - default
          - OPS3-FLUME-ES
    host_type: private_ip
  type: ec2
        enabled: false
        enabled: false
  recover_after_time: 5m
  number_of_replicas: 1
  number_of_shards: 10
      limit: 85%
      size: 25%
    max_bytes_per_sec: -1
      max_bytes_per_sec: -1
    flush_threshold_size: 1024mb
  name: prod-flume-es-useast1-109-i-XXX-flume-elasticsearch-production_vpc-useast1
  data: /mnt/elasticsearch/flume-elasticsearch-production_vpc-useast1

In our case, we've had the cluster crash now at least 5 times. Each time it crashes we've observed that one or more of our nodes go into a JVM garbage collection loop. When that happens, the cluster management code loses connectivity to the various nodes, new document indexing starts to fail, and we effectively fall over.

We've finally narrowed this down we think to the indices.fielddata.cache.size setting. We had originally it at 75%. We tried dropping it down to 50% and still had a failure yesterday. We're now running at 25% as you can see in the config above and we're hoping everything works.

Our theory at this point is that we have no problem indexing documents, but because our cluster is running combined data and master nodes,
when the JVM GC gets involved all hell breaks loose. We think that users pull up a dashboard in Kibana (we make heavy use of structured logs), the search indexes a ton of field data (which is allowed by the large indices.fielddata.cache.size setting), and we ultimately force the instance into a garbage collection loop.

The questions we havn't answered yet are..

  • What else is using mass amounts of memory out of the HEAP? If we have 29GB available and we allocate 14.5GB to field data, what is using the other 14.5GB?

  • We've seen full cluster failures (effectively access to the cluster API hangs, indexing freezes, etc) even when only a single node in the cluster begins the GC loop. Why? If we move to dedicated master nodes (planned for next week) and a data node starts into its GC loop, will the cluster continue to operate properly?

  • Why is the indices.fielddata.cache.expires setting experimental and marked for deprecation? It seems wildly helpful to clear out this cache of expired documents and data whenever appropriate? Am I missing something here?

Can you provide more details on your cluster; how many nodes, what version?

Might be better to create your own thread, there's a fair bit there and we'd be hijacking the OPs question :slight_smile:

We have 10 servers, Each with 8 cores and 64G memory. Each server runs one ES node with 30G memory allocated, swapping disabled.
We are running ES 1.6.0 with JVM 1.70_79, all other options defaulted. More specific, the process is like this:

/data/elastic/java/jdk1.7.0_79/bin/java -Xms30g -Xmx30g -Djava.awt.headless=true -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+HeapDumpOnOutOfMemoryError -XX:+DisableExplicitGC -Dfile.encoding=UTF-8 -Delasticsearch -Des.pidfile=run/ -Des.path.home=/data/elastic -cp :/data/elastic/lib/elasticsearch-1.6.0.jar:/data/elastic/lib/:/data/elastic/lib/sigar/ org.elasticsearch.bootstrap.Elasticsearch

We have 3 indices running:
index 1: 10 primary shards, x2 replica shards, 235M docs, 28G data x3
index 2: 5 primary shards, x2 replica shards, 17M docs, 7G data x3
index 3: 9 primary shards, x2 replica shards, 340M docs, 29G data x3

What sort of data structure is it, and what do the queries look like?

@niulin @diranged We've seen significant improvement in stability with dedicated masters. There are some more GC parameters that we've been experimenting on, will post back once the tests are done.