Heap usage causing node failure - 5.5.2

Running cluster of 6 nodes on Ubuntu 14.04 with 128G RAM each on 5.5.2

Heap is currently set to 28G and confirmed ES is using zero based compressed Oops as recommended in https://www.elastic.co/blog/a-heap-of-trouble

Been having issues with garbage collection taking nodes offline. For some reason when one node is taken down then the whole cluster becomes unresponsive.

These lines appear in log and then it dies

[2017-08-24T09:43:07,034][INFO ][o.e.m.j.JvmGcMonitorService] [se-prod-logyard5] [gc][247] overhead, spent [325ms] collecting in the last [1s]
[2017-08-24T09:43:08,035][WARN ][o.e.m.j.JvmGcMonitorService] [se-prod-logyard5] [gc][248] overhead, spent [633ms] collecting in the last [1s]
[2017-08-24T09:43:09,035][INFO ][o.e.m.j.JvmGcMonitorService] [se-prod-logyard5] [gc][249] overhead, spent [371ms] collecting in the last [1s]
[2017-08-24T09:43:10,037][INFO ][o.e.m.j.JvmGcMonitorService] [se-prod-logyard5] [gc][250] overhead, spent [460ms] collecting in the last [1s]

Here is heap usage over the past hour. Not sure yet if this indicates an issue with the heap being too big/memory leak/something we're doing wrong

Appreciate any advice on this

Without any indication where what the memory is spent for, this is hard to help. First, are you sure, that you need 28 GB of memory for each node? Why not 16 or 8? What is this memory used for? You can use the node stats and node info APIs to find out more.

The question I am asking is, that if the memory is not really needed, it just gets used over time, but then a huge GC will take a lot of time, because it is able to clear out a lot of memory.

if you need this memory, maybe there are strategies to reduce memory consumption (different mappings, less shards ,etc).

So talking about your general setup might be a good thing, so other people get more context.

Not sure if we need 28G per node at this point, actually reduced it to 18G and it's better but still experiencing issues. I'll get some more insight into memory usage in a bit.

Just had a crash on one of our nodes and got this message after some long garbage collection times:

Caused by: org.elasticsearch.common.breaker.CircuitBreakingException: [request] Data too large, data for [<reused_arrays>] would be [16605848256/15.4gb], which is larger than the limit of [16001453260/14.9gb]

This was caused by a visualization that had a terms bucket set to size 500,000,000 so I think this can be closed...

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.