I have a 5 node ES Cluster with 32 Gig RAM on each node. I assign 20GB to ES process. These are relevant fields in my yml.
discovery.zen.ping.timeout: 60s (changed this as 10 s was not enough)
Elasticsearch Version: 1.3.1
I index anywhere between 500-1000 documents per minute (which are structured more like tweets and social network data). I have 406 Million documents in my cluster (replica excluded) and 800 GB of data (replica included).
Recently I observed heap continuously increasing and in the end, GC pause of OOM takes nodes down. I figured that this is more of a problem with indexing than querying as the field data cache and filter cache never exceeds 3 GB combined.
This is the current cluster health
I would like to know where I need to do improvements. Should I increase the RAM to 64 G per node and such options. I am also considering using doc_values and upgrading ES to latest version. But I would like to understand the root cause of this behaviour before taking any action.
This is the hot threads output https://gist.github.com/naryad/abe852c04dbac5e5611a
This is the output of node stats API https://gist.github.com/naryad/06ec0e17c0c02e311e80
After heap gets filled slowly (old generation objects) and GC happens none of the old generation objects get cleared because of GC. Old generation objects account up to 90% of the 20 GB heap allocated to ES.