GC Pauses and OOM errors when indexing into a 800 GB cluster

(naryad) #1

I have a 5 node ES Cluster with 32 Gig RAM on each node. I assign 20GB to ES process. These are relevant fields in my yml.

discovery.type: ec2
discovery.zen.ping.multicast.enabled: false
discovery.zen.ping.timeout: 60s (changed this as 10 s was not enough)
discovery.zen.minimum_master_nodes: 3
script.disable_dynamic: true
bootstrap.mlockall: true
indices.fielddata.cache.size: 50%
indices.breaker.fielddata.limit: 60%
indices.breaker.request.limit: 40%
indices.breaker.total.limit: 70%

Elasticsearch Version: 1.3.1

I index anywhere between 500-1000 documents per minute (which are structured more like tweets and social network data). I have 406 Million documents in my cluster (replica excluded) and 800 GB of data (replica included).

Recently I observed heap continuously increasing and in the end, GC pause of OOM takes nodes down. I figured that this is more of a problem with indexing than querying as the field data cache and filter cache never exceeds 3 GB combined.

This is the current cluster health
"cluster_name": "name_of_cluster",
"status": "yellow",
"timed_out": false,
"number_of_nodes": 5,
"number_of_data_nodes": 5,
"active_primary_shards": 15,
"active_shards": 26,
"relocating_shards": 0,
"initializing_shards": 4,
"unassigned_shards": 0

I would like to know where I need to do improvements. Should I increase the RAM to 64 G per node and such options. I am also considering using doc_values and upgrading ES to latest version. But I would like to understand the root cause of this behaviour before taking any action.

This is the hot threads output https://gist.github.com/naryad/abe852c04dbac5e5611a
This is the output of node stats API https://gist.github.com/naryad/06ec0e17c0c02e311e80

After heap gets filled slowly (old generation objects) and GC happens none of the old generation objects get cleared because of GC. Old generation objects account up to 90% of the 20 GB heap allocated to ES.

(Jörg Prante) #2

Can you update to a more recent ES version?

(naryad) #3

I can do that. In fact that is the immediate next thing I am gonna do. Thanks for the reply. Just wanted to know if I am doing something wrong or do I need to update RAM to 64 GB RAM or increase the number of nodes etc. or tune any settings.

(naryad) #4

Looks like upgrade has some solid effect. No longer does the heap continuously fill.

(system) #5