Heap usage causing node failure - 5.5.2

erikanderson753 · August 24, 2017, 5:38pm

Running cluster of 6 nodes on Ubuntu 14.04 with 128G RAM each on 5.5.2

Heap is currently set to 28G and confirmed ES is using zero based compressed Oops as recommended in https://www.elastic.co/blog/a-heap-of-trouble

Been having issues with garbage collection taking nodes offline. For some reason when one node is taken down then the whole cluster becomes unresponsive.

These lines appear in log and then it dies

[2017-08-24T09:43:07,034][INFO ][o.e.m.j.JvmGcMonitorService] [se-prod-logyard5] [gc][247] overhead, spent [325ms] collecting in the last [1s]
[2017-08-24T09:43:08,035][WARN ][o.e.m.j.JvmGcMonitorService] [se-prod-logyard5] [gc][248] overhead, spent [633ms] collecting in the last [1s]
[2017-08-24T09:43:09,035][INFO ][o.e.m.j.JvmGcMonitorService] [se-prod-logyard5] [gc][249] overhead, spent [371ms] collecting in the last [1s]
[2017-08-24T09:43:10,037][INFO ][o.e.m.j.JvmGcMonitorService] [se-prod-logyard5] [gc][250] overhead, spent [460ms] collecting in the last [1s]

Here is heap usage over the past hour. Not sure yet if this indicates an issue with the heap being too big/memory leak/something we're doing wrong

Appreciate any advice on this

spinscale · August 25, 2017, 8:05am

Without any indication where what the memory is spent for, this is hard to help. First, are you sure, that you need 28 GB of memory for each node? Why not 16 or 8? What is this memory used for? You can use the node stats and node info APIs to find out more.

The question I am asking is, that if the memory is not really needed, it just gets used over time, but then a huge GC will take a lot of time, because it is able to clear out a lot of memory.

if you need this memory, maybe there are strategies to reduce memory consumption (different mappings, less shards ,etc).

So talking about your general setup might be a good thing, so other people get more context.

erikanderson753 · August 25, 2017, 2:22pm

Not sure if we need 28G per node at this point, actually reduced it to 18G and it's better but still experiencing issues. I'll get some more insight into memory usage in a bit.

Just had a crash on one of our nodes and got this message after some long garbage collection times:

Caused by: org.elasticsearch.common.breaker.CircuitBreakingException: [request] Data too large, data for [<reused_arrays>] would be [16605848256/15.4gb], which is larger than the limit of [16001453260/14.9gb]

erikanderson753 · August 25, 2017, 9:41pm

This was caused by a visualization that had a terms bucket set to size 500,000,000 so I think this can be closed...

system · September 22, 2017, 9:41pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Elasticsearch Over Heap Any times Elasticsearch	3	569	April 17, 2018
JVM heap size usage and causes Elasticsearch	9	1939	September 25, 2019
High heap usage Elasticsearch	6	981	March 8, 2019
Continous GC on Master Node Elasticsearch	7	864	October 4, 2018
Elasticsearch node crashed Elasticsearch	5	662	August 3, 2022

Heap usage causing node failure - 5.5.2

Related topics