We've been running into a similar issue recently -- and I don't think it has to do with indexing data as @Srinath_C's thread link suggests.
In our case we're running
4 x c3.8xlarge instances with 10TB of SSD-backed provisioned-iops volumes attached to each host (8000 IOPS per host). We index upwards of ~20k events per second peak during the day, and our indexed documents range quite a bit in field types, counts, etc. We have 331 indexed fields in our current day of events, for example.
(We store 30 days of data, at ~300GB/day for a total of ~9-12TB of used storage and ~9 billion documents)
Our hosts have 60+GB of memory, and we allocate 29.5GB of that to the HEAP on-startup, leaving ~30GB for lucene indexing and other local processes.
/usr/lib/jvm/java-7-oracle/bin/java -Xms29696m -Xmx29696m -Djava.awt.headless=true -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+HeapDumpOnOutOfMemoryError -XX:+DisableExplicitGC
Our elasticsearch.yml file:
In our case, we've had the cluster crash now at least 5 times. Each time it crashes we've observed that one or more of our nodes go into a JVM garbage collection loop. When that happens, the cluster management code loses connectivity to the various nodes, new document indexing starts to fail, and we effectively fall over.
We've finally narrowed this down we think to the
indices.fielddata.cache.size setting. We had originally it at
75%. We tried dropping it down to
50% and still had a failure yesterday. We're now running at
25% as you can see in the config above and we're hoping everything works.
Our theory at this point is that we have no problem indexing documents, but because our cluster is running combined
when the JVM GC gets involved all hell breaks loose. We think that users pull up a dashboard in Kibana (we make heavy use of structured logs), the search indexes a ton of field data (which is allowed by the large
indices.fielddata.cache.size setting), and we ultimately force the instance into a garbage collection loop.
The questions we havn't answered yet are..
What else is using mass amounts of memory out of the HEAP? If we have 29GB available and we allocate 14.5GB to
field data, what is using the other 14.5GB?
We've seen full cluster failures (effectively access to the cluster API hangs, indexing freezes, etc) even when only a single node in the cluster begins the GC loop. Why? If we move to dedicated
master nodes (planned for next week) and a
data node starts into its GC loop, will the cluster continue to operate properly?
Why is the
indices.fielddata.cache.expires setting experimental and marked for deprecation? It seems wildly helpful to clear out this cache of expired documents and data whenever appropriate? Am I missing something here?