GC Pauses and OOM errors when indexing into a 800 GB cluster

I have a 5 node ES Cluster with 32 Gig RAM on each node. I assign 20GB to ES process. These are relevant fields in my yml.

discovery.type: ec2
discovery.zen.ping.multicast.enabled: false
discovery.zen.ping.timeout: 60s (changed this as 10 s was not enough)
discovery.zen.minimum_master_nodes: 3
script.disable_dynamic: true
bootstrap.mlockall: true
indices.fielddata.cache.size: 50%
indices.breaker.fielddata.limit: 60%
indices.breaker.request.limit: 40%
indices.breaker.total.limit: 70%

Elasticsearch Version: 1.3.1

I index anywhere between 500-1000 documents per minute (which are structured more like tweets and social network data). I have 406 Million documents in my cluster (replica excluded) and 800 GB of data (replica included).

Recently I observed heap continuously increasing and in the end, GC pause of OOM takes nodes down. I figured that this is more of a problem with indexing than querying as the field data cache and filter cache never exceeds 3 GB combined.

This is the current cluster health
{
"cluster_name": "name_of_cluster",
"status": "yellow",
"timed_out": false,
"number_of_nodes": 5,
"number_of_data_nodes": 5,
"active_primary_shards": 15,
"active_shards": 26,
"relocating_shards": 0,
"initializing_shards": 4,
"unassigned_shards": 0
}

I would like to know where I need to do improvements. Should I increase the RAM to 64 G per node and such options. I am also considering using doc_values and upgrading ES to latest version. But I would like to understand the root cause of this behaviour before taking any action.

This is the hot threads output https://gist.github.com/naryad/abe852c04dbac5e5611a
This is the output of node stats API https://gist.github.com/naryad/06ec0e17c0c02e311e80

After heap gets filled slowly (old generation objects) and GC happens none of the old generation objects get cleared because of GC. Old generation objects account up to 90% of the 20 GB heap allocated to ES.

Can you update to a more recent ES version?

I can do that. In fact that is the immediate next thing I am gonna do. Thanks for the reply. Just wanted to know if I am doing something wrong or do I need to update RAM to 64 GB RAM or increase the number of nodes etc. or tune any settings.

Looks like upgrade has some solid effect. No longer does the heap continuously fill.