I have a 5 node ES Cluster with 32 Gig RAM on each node. I assign 20GB to ES process. These are relevant fields in my yml.
discovery.type: ec2
discovery.zen.ping.multicast.enabled: false
discovery.zen.ping.timeout: 60s (changed this as 10 s was not enough)
discovery.zen.minimum_master_nodes: 3
script.disable_dynamic: true
bootstrap.mlockall: true
indices.fielddata.cache.size: 50%
indices.breaker.fielddata.limit: 60%
indices.breaker.request.limit: 40%
indices.breaker.total.limit: 70%
Elasticsearch Version: 1.3.1
I index anywhere between 500-1000 documents per minute (which are structured more like tweets and social network data). I have 406 Million documents in my cluster (replica excluded) and 800 GB of data (replica included).
Recently I observed heap continuously increasing and in the end, GC pause of OOM takes nodes down. I figured that this is more of a problem with indexing than querying as the field data cache and filter cache never exceeds 3 GB combined.
This is the current cluster health
{
"cluster_name": "name_of_cluster",
"status": "yellow",
"timed_out": false,
"number_of_nodes": 5,
"number_of_data_nodes": 5,
"active_primary_shards": 15,
"active_shards": 26,
"relocating_shards": 0,
"initializing_shards": 4,
"unassigned_shards": 0
}
I would like to know where I need to do improvements. Should I increase the RAM to 64 G per node and such options. I am also considering using doc_values and upgrading ES to latest version. But I would like to understand the root cause of this behaviour before taking any action.
This is the hot threads output https://gist.github.com/naryad/abe852c04dbac5e5611a
This is the output of node stats API https://gist.github.com/naryad/06ec0e17c0c02e311e80
After heap gets filled slowly (old generation objects) and GC happens none of the old generation objects get cleared because of GC. Old generation objects account up to 90% of the 20 GB heap allocated to ES.