Hi, we are experiencing some issues with our Elasticsearch cluster: after around 24 hours of usage the nodes report more than 95% of heap usage and they start becoming unresponsive bringing down the whole cluster; we need to constantly restart them if we want to keep it alive. This is the command line we use to start the Elasticsearch process:
Another interesting detail is that in our CQRS architecture if we attach a cluster only to the write path the heap issue doesn't seem to occur but nodes start climbing up as soon as we attach them to the read path of the system.
This seems to point to some caching issue or to some query that is triggering the behaviour; on the latter, we do have a query to which we provide up to 5000 ids to exclude from the result as in a WHERE NOT IN (...) SQL query. Is there anything we should do to take care of this specific query?
Any advice on what could be causing this issue or how we can keep the heap usage from blowing up the cluster would be really appreciated.
Which scripting language are you using and how do you generate your scripts? Is it possible to move the dynamic portion of your script into parameters instead of generating scripts on the fly?
Hi Igor,
Thanks for replying.
We use groovy. We generate the scripts in the application code, alongside the queries, and we pass them to ES (as opposed to stored groovy scripts). The scripts access document fields to compute a couple of values and change the score accordingly.
This is used in two simple cases, it would be non-trivial to refactor and it is used quite a lot.
I don't know for sure. To figure out exactly what's leaking memory we would need to take a look at heap dump. Without it I can only guess and poke at things that don't look quite right (i.e. different from typical installations that don't experience memory leaks). In your case it looks like all these script are getting evicted, but you recompile them on every request which was causing issues before. So I just assume that it might be possible that this issue wasn't quite fixed in your particular use case.
There is a good chance you are right:
I ran the eclipse analyzer in the dump and it seems that: The class "org.codehaus.groovy.reflection.ClassInfo", loaded by "java.net.FactoryURLClassLoader @ 0x1c42d78b8", occupies 1,899,607,464 (43.61%) bytes. The memory is accumulated in class "org.codehaus.groovy.reflection.ClassInfo", loaded by "java.net.FactoryURLClassLoader @ 0x1c42d78b8".
which indicates that it is a prime leak suspect.
Looking at the histogram in jhat:
Class Instance Count Total Size
class [C 1064936 1229061570
class org.codehaus.groovy.runtime.metaclass.MetaMethodIndex$Entry 12038167 1011206028
class [B 422619 712977967
class [Ljava.lang.Object; 9556045 387915336
class [Lorg.codehaus.groovy.util.ComplexKeyHashMap$Entry; 968641 263472144
class [Lorg.codehaus.groovy.runtime.metaclass.MetaMethodIndex$Entry; 69207 142814832
class java.util.HashMap$Node 5040441 141132348
class [Ljava.util.HashMap$Node; 548700 130578824
class org.codehaus.groovy.util.FastArray 8716940 104603280
class java.lang.reflect.Method 585909 76168170
class org.codehaus.groovy.util.SingleKeyHashMap$Entry 2283352 63933856
class java.util.HashMap 1268142 60870816
class java.lang.invoke.MethodHandleImpl$CountingWrapper 1115156 60218424
I will take another dump today and I will share it. The HEAP increased by 3gb per node (on average) overnight,
so I guess the diff of the two will reveal additional info.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.