Java.lang.OutOfMemoryError after trying to garbage collect for 20 minutes


(Suny Kim) #1

Hi. Our cluster crashed with OOM yesterday at 13:00 , with

java.lang.OutOfMemoryError: Java heap space
at org.elasticsearch.common.compress.BufferRecycler.allocDecodeBuffer(BufferRecycler.java:137)
at org.elasticsearch.common.compress.lzf.LZFCompressedStreamInput.(LZFCompressedStreamInput.java:46)
...

It looks like the node desperately tried to garbage collect memory from 12:40 on, both old and young generation, but without much success. The collection preceding the OOM was

[2015-11-12 13:00:06,312][WARN ][monitor.jvm ] [xxx] [gc][old][3121308][923] duration [11.6s], collections [1]/[11.6s], total [11.6s]/[10.3m], memory [29.8gb]->[29.8gb]/[29.9gb], all_pools {[young] [819.2mb]->[819.2mb]/[819.2mb]}{[survivor] [82.7mb]->[101.5mb]/[102.3mb]}{[old] [28.9gb]->[29gb]/[29gb]}

elasticsearch version "1.6.0"
java version "1.8.0_45"
-XX:+UseParNewGC
-XX:+UseConcMarkSweepGC
-XX:+DisableExplicitGC
-Xms30g -Xmx30g

We don't have much insight into the situation before the crash, sadly. The cluster is in production and ran fine for many months, it's part of an ELK stack.
Our mapping says "doc_values" : true for nearly everything.


(system) #2