Elasticsearch 6.6.2 constantly failing with Out Of Memory Errors

I had a 7 node cluster, with 4 data nodes and 3 master eligible nodes.
Each data node is of 64GB Ram configured with 30GB heapsize, 8 cpu and 2TB SSD Disk.

All the 4 data nodes are going down, after few hours of restart constantly, with out of memory error.
I reduced my ingestion load from 20k docs/second to 15k and 10k docs/second, but it didn't help much. I kept my search rate to 0 other than the kibana monitoring dashboard refreshing once 30s to show what is the state of my cluster on.

Following are my cluster stats
{ "_nodes" : { "total" : 7, "successful" : 7, "failed" : 0 }, "cluster_name" : "mtr-ng1", "cluster_uuid" : "C9jjlhCmR9yeMwukeQyZZA", "timestamp" : 1553487063305, "status" : "green", "indices" : { "count" : 196, "shards" : { "total" : 1534, "primaries" : 767, "replication" : 1.0, "index" : { "shards" : { "min" : 2, "max" : 10, "avg" : 7.826530612244898 }, "primaries" : { "min" : 1, "max" : 5, "avg" : 3.913265306122449 }, "replication" : { "min" : 1.0, "max" : 1.0, "avg" : 1.0 } } }, "docs" : { "count" : 9609557961, "deleted" : 47530194 }, "store" : { "size_in_bytes" : 2525518088368 }, "fielddata" : { "memory_size_in_bytes" : 135728, "evictions" : 0 }, "query_cache" : { "memory_size_in_bytes" : 4884394, "total_count" : 120970, "hit_count" : 14068, "miss_count" : 106902, "cache_size" : 595, "cache_count" : 651, "evictions" : 56 }, "completion" : { "size_in_bytes" : 0 }, "segments" : { "count" : 20164, "memory_in_bytes" : 6746448528, "terms_memory_in_bytes" : 4794965934, "stored_fields_memory_in_bytes" : 1225705688, "term_vectors_memory_in_bytes" : 0, "norms_memory_in_bytes" : 1583616, "points_memory_in_bytes" : 667648162, "doc_values_memory_in_bytes" : 56545128, "index_writer_memory_in_bytes" : 69417260, "version_map_memory_in_bytes" : 636406, "fixed_bit_set_memory_in_bytes" : 159360, "max_unsafe_auto_id_timestamp" : 1553485454779, "file_sizes" : { } } }, "nodes" : { "count" : { "total" : 7, "data" : 4, "coordinating_only" : 0, "master" : 3, "ingest" : 3 }, "versions" : [ "6.6.2" ], "os" : { "available_processors" : 72, "allocated_processors" : 72, "names" : [ { "name" : "Linux", "count" : 7 } ], "pretty_names" : [ { "pretty_name" : "CentOS Linux 7 (Core)", "count" : 7 } ], "mem" : { "total_in_bytes" : 424374198272, "free_in_bytes" : 82914578432, "used_in_bytes" : 341459619840, "free_percent" : 20, "used_percent" : 80 } }, "process" : { "cpu" : { "percent" : 0 }, "open_file_descriptors" : { "min" : 423, "max" : 3719, "avg" : 2171 } }, "jvm" : { "max_uptime_in_millis" : 320148670, "versions" : [ { "version" : "1.8.0_65", "vm_name" : "Java HotSpot(TM) 64-Bit Server VM", "vm_version" : "25.65-b01", "vm_vendor" : "Oracle Corporation", "count" : 7 } ], "mem" : { "heap_used_in_bytes" : 20870601960, "heap_max_in_bytes" : 218467926016 }, "threads" : 661 }, "fs" : { "total_in_bytes" : 7685746163712, "free_in_bytes" : 4969893056512, "available_in_bytes" : 4589853052928 }, "plugins" : [ ], "network_types" : { "transport_types" : { "security4" : 7 }, "http_types" : { "security4" : 7 } } } }

Following are the snapshots of kibana monitoring dashboard, at the time the problem is happening.

The spike in heap usage, and the fact that it affects all the data nodes at the same time, suggests a sudden shift in workload. Is there an expensive search running at that time?

What do GET _nodes/hot_threads?threads=9999 and GET _tasks?detailed say about the work that the cluster is doing when its heap is so high?

Hi @DavidTurner,
There are no searches running at the time. The issue is consistently reproducible. Heap is filling up to 75% as the indexing is going on, at 75% Gc kicks in, memory consumption comes down, this repeats once or twice and third or fourth time, Gc times out and memory consumption won't come down, So it holds up at 95%+ for some time and eventually node becomes unresponsive.

Is there a way, I can know, what is actually filling up my memory. Like if my heap is 30gb and its filled up. I want to know, which component is taking which percent of this heap.

Both hot threads and tasks detailed are saying, bulk ingestion is the only thing thats happening in the cluster.

What do you mean "GC times out"?

The pattern of increasing heap usage followed by a drop is normal. The graph above indicates that the second GC was successful (the heap usage seems to drop back down to 4.7GB) but then there's a spike shortly after.

If we need to go deeper than the API calls I suggested above then we will need to look at a heap dump.

What type of data are you indexing? Are you using nested documents and/or parent child relations? Are you performing updates or just indexing new immutable documents?

Hi @Christian_Dahlqvist,
No nested documents or parent child relations, no update, I am just indexing new immutable documents.

@DavidTurner,
The reason I mentioned Gc times out is I could see that from the elasticsearch logs.

[2019-03-23T16:45:40,916][WARN ][o.e.m.j.JvmGcMonitorService] [es1.ng2] [gc][old][2531][52] duration [45.9s], collections [1]/[46s], total [45.9s]/[44.1m], memory [29.9gb]->[29.9gb]/[29.9gb], all_pools {[young] [532.5mb]->[532.5mb]/[532.5mb]}{[survivor] [60.1mb]->[62.3mb]/[66.5mb]}{[old] [29.3gb]->[29.3gb]/[29.3gb]} [2019-03-23T16:45:40,916][WARN ][o.e.m.j.JvmGcMonitorService] [es1.ng2] [gc][2531] overhead, spent [45.9s] collecting in the last [46s]
[2019-03-23T16:53:58,529][ERROR][o.e.x.m.c.n.NodeStatsCollector] [es1.ng2] collector [node_stats] timed out when collecting data
[2019-03-23T17:31:45,405][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [es1.ng2] fatal error in thread [Connection evictor], exiting
java.lang.OutOfMemoryError: Java heap space
at org.apache.http.pool.AbstractConnPool.closeExpired(AbstractConnPool.java:559) ~[?:?]
at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.closeExpiredConnections(PoolingHttpClientConnectionManager.java:409) ~[?:?]
at org.apache.http.impl.client.IdleConnectionEvictor$1.run(IdleConnectionEvictor.java:67) ~[?:?]
at java.lang.Thread.run(Thread.java:745) [?:1.8.0_65

If you hit an OutOfMemoryError then, by default, Elasticsearch will have written a heap dump. The best thing to do would be to look at this to see what was consuming all the heap.

I got the heap dumped after the above error, Is there a way I could analyze the heap dump, to figure out what is taking up memory?.

Are there any commands in elastiscearch, which can tell me, who is taking up my 4 * 30GB memory?.

Are there any configs in elasticsearch, which I could tune to make sure that, memory consumption is throttled and I never need to worry about elasticsearch fails with Out of memory error?.

I am ok with my queries taking more time and ingestion tasks more time, My major concern is keeping elasticsearch up and running stable all the time.

You can use MAT or else you can send them to me using this link and I will take a look.

The heap dump is the best bet here.

Hi @DavidTurner,

I believe I figured out the issue here,
There is this following query, which I need to run once an hour, that's filling up my memory and bringing the cluster down,

{ "aggs" : { "requests" : { "terms" : { "field" : "reqtype.keyword" }, "aggs": { "clouds": { "terms": { "field": "cloudname" }, "aggs": { "orgs" : { "terms" : { "field" : "companyid", "size" : 10000 }, "aggs": { "destinations": { "terms" : { "field" : "destip", "size" : 500000 }, "aggs": { "locationids" : { "terms" : { "field" : "locationid" }, "aggs": { "locations" : { "terms" : { "field" : "location" }, "aggs" : { "active_dc" : { "terms" : { "field" : "activedc" } }, "active_latency" : { "max" : { "field" : "avg", "script" : "doc[\"dc\"].value == doc[\"activedc\"].value ? doc[\"avg\"].value : -1" } }, "optimal_dc" : { "terms": { "field" : "dc", "size" : 1, "order" : { "latency" : "asc" } }, "aggs" : { "latency" : { "max" : { "field" : "avg" } } } } } } } } } } } } } } } } }, "size": 0, "query": { "bool": { "must": [ { "query_string": { "query": "isLatency : true", "analyze_wildcard": "true", "default_field": "*" } }, { "range": { "time": { "gte": prevtime * 1000, "lte": curtime * 1000, "format": "epoch_millis" } } } ], "filter": [], "should": [], "must_not": [] } } }

is there a way I can constraint the resource utilization for this query,
Its fine that this query took one hour to finish.

You have some large size parameters specified in this deeply nested aggregation, and this will use a lot of heap. I would recommend trying to decrease these or find a different and more efficient way to run this.

Hi @Christian_Dahlqvist,

I need to run this query to get the value out of my data, Is there any way you can think, I can make this efficient.
I already made all the fields which are part of the queries to be keywords.

And also, my major concern is constraining the resources,
Is there a way, I can make sure, Irrespective how complex the nested aggregations are, ES will just either take hours to finish or timeout after an hour, saying that your query is beyond the capability of my resources instead of going unresponsive and down?

Then you should reduce the depth of nesting in this aggregation, and reduce the sizes to something more reasonable. Do you really need the top 500000 destip values? Do you really need to do this for the top 10000 companyids?

It looks like you're trying to find instances of high latency - is it not sufficient to look at the top few? I cannot really see how the result of this query could be useful to you without further post-processing, because it will be too large. You need to refine your query to one that gives you the information you are actually looking for.

1 Like

Might you be able to split it up into a number of smaller aggregations by using composite aggregations, while at the same time reducing the size parameters to a lower value that still captures what you want?

Hi @DavidTurner,
Basically I am running this query on my raw mtr data index and reingesting the data back to an elasticsearch index mtr suggestions, which acts as my actual database. Where I can search for any company or destip and compare their active latency and their optimal latency.

If I reduce these companyid or destip numbers, I may end up loosing the information of those particular companies or destinatins in my suggestions index.

Hi @Christian_Dahlqvist,

Composite aggregations seem to be the way for my use case, I will try it out.

But, is there no way, I could fix the resources used by searches on the ES cluster?. I don't want some terms aggregation to bring the cluster down. ES can break the query or time it out, or take huge time to finish, But becoming unresponsive will disturb the whole pipeline.

Is there no way, but to increase the heap size of the cluster?, If ever I want to run the query. (If even after optimization if it is still failing the cluster with OOM)

One additional thing you may want to look at is the rollup API? Am not sure how this handle fields with high cardinality, but it might be worth trying. Maybe someone else with more practical experience can share some light.

Hi @Christian_Dahlqvist,

Rollup API is indeed a very useful feature and reduced a great deal of work I do with scripts. I am already rolling up my indices using index management policy.

My query runs only on current days and previous day's index, the whole index size of those two indices come to 30gb, and 60gb with 1 replica, However, my heap size is around 120 Gb.

Can you direct me to some page or info, which I can look to understand, why a query on the 30GB dataset is pulling down a cluster of the 120GB heap?