Hi. Our cluster crashed with OOM yesterday at 13:00 , with
java.lang.OutOfMemoryError: Java heap space
at org.elasticsearch.common.compress.BufferRecycler.allocDecodeBuffer(BufferRecycler.java:137)
at org.elasticsearch.common.compress.lzf.LZFCompressedStreamInput.(LZFCompressedStreamInput.java:46)
...
It looks like the node desperately tried to garbage collect memory from 12:40 on, both old and young generation, but without much success. The collection preceding the OOM was
[2015-11-12 13:00:06,312][WARN ][monitor.jvm ] [xxx] [gc][old][3121308][923] duration [11.6s], collections [1]/[11.6s], total [11.6s]/[10.3m], memory [29.8gb]->[29.8gb]/[29.9gb], all_pools {[young] [819.2mb]->[819.2mb]/[819.2mb]}{[survivor] [82.7mb]->[101.5mb]/[102.3mb]}{[old] [28.9gb]->[29gb]/[29gb]}
elasticsearch version "1.6.0"
java version "1.8.0_45"
-XX:+UseParNewGC
-XX:+UseConcMarkSweepGC
-XX:+DisableExplicitGC
-Xms30g -Xmx30g
We don't have much insight into the situation before the crash, sadly. The cluster is in production and ran fine for many months, it's part of an ELK stack.
Our mapping says "doc_values" : true
for nearly everything.