Nodes slowly climbing to high memory usage

Hi, we are experiencing some issues with our Elasticsearch cluster: after around 24 hours of usage the nodes report more than 95% of heap usage and they start becoming unresponsive bringing down the whole cluster; we need to constantly restart them if we want to keep it alive. This is the command line we use to start the Elasticsearch process:

/usr/bin/java -Xms25g -Xmx25g -Djava.awt.headless=true -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+HeapDumpOnOutOfMemoryError -XX:+DisableExplicitGC -Dfile.encoding=UTF-8 -Djna.nosys=true -Des.path.home=/usr/share/elasticsearch -cp /usr/share/elasticsearch/lib/elasticsearch-2.4.1.jar:/usr/share/elasticsearch/lib/* org.elasticsearch.bootstrap.Elasticsearch start -p /var/run/elasticsearch/elasticsearch.pid -d -Des.default.path.home=/usr/share/elasticsearch -Des.default.path.logs=/var/log/elasticsearch -Des.default.path.data=/var/lib/elasticsearch -Des.default.path.conf=/etc/elasticsearch

Another interesting detail is that in our CQRS architecture if we attach a cluster only to the write path the heap issue doesn't seem to occur but nodes start climbing up as soon as we attach them to the read path of the system.

This seems to point to some caching issue or to some query that is triggering the behaviour; on the latter, we do have a query to which we provide up to 5000 ids to exclude from the result as in a WHERE NOT IN (...) SQL query. Is there anything we should do to take care of this specific query?

Any advice on what could be causing this issue or how we can keep the heap usage from blowing up the cluster would be really appreciated.

Many thanks!

Could you post node stats for a node that reached 95% of heap?

Sorry for the late answer, I'm attaching all the stats I could find about a faulty node:

Thanks again!

I couldn't see a smoking guns by looking at stats, but this portion might be interesting to investigate:

         "script":{
            "compilations":387991,
            "cache_evictions":387891
         }

Which scripting language are you using and how do you generate your scripts? Is it possible to move the dynamic portion of your script into parameters instead of generating scripts on the fly?

Hi Igor,
Thanks for replying.
We use groovy. We generate the scripts in the application code, alongside the queries, and we pass them to ES (as opposed to stored groovy scripts). The scripts access document fields to compute a couple of values and change the score accordingly.
This is used in two simple cases, it would be non-trivial to refactor and it is used quite a lot.

Do you reckon those get cached and never evicted?

Thanks,
Aris

Do you reckon those get cached and never evicted?

I don't know for sure. To figure out exactly what's leaking memory we would need to take a look at heap dump. Without it I can only guess and poke at things that don't look quite right (i.e. different from typical installations that don't experience memory leaks). In your case it looks like all these script are getting evicted, but you recompile them on every request which was causing issues before. So I just assume that it might be possible that this issue wasn't quite fixed in your particular use case.

Any chance we can take a look at the heap dump?

There is a good chance you are right:
I ran the eclipse analyzer in the dump and it seems that:
The class "org.codehaus.groovy.reflection.ClassInfo", loaded by "java.net.FactoryURLClassLoader @ 0x1c42d78b8", occupies 1,899,607,464 (43.61%) bytes. The memory is accumulated in class "org.codehaus.groovy.reflection.ClassInfo", loaded by "java.net.FactoryURLClassLoader @ 0x1c42d78b8".
which indicates that it is a prime leak suspect.

Looking at the histogram in jhat:


Class	Instance Count	Total Size
class [C	1064936	1229061570
class org.codehaus.groovy.runtime.metaclass.MetaMethodIndex$Entry	12038167	1011206028
class [B	422619	712977967
class [Ljava.lang.Object;	9556045	387915336
class [Lorg.codehaus.groovy.util.ComplexKeyHashMap$Entry;	968641	263472144
class [Lorg.codehaus.groovy.runtime.metaclass.MetaMethodIndex$Entry;	69207	142814832
class java.util.HashMap$Node	5040441	141132348
class [Ljava.util.HashMap$Node;	548700	130578824
class org.codehaus.groovy.util.FastArray	8716940	104603280
class java.lang.reflect.Method	585909	76168170
class org.codehaus.groovy.util.SingleKeyHashMap$Entry	2283352	63933856
class java.util.HashMap	1268142	60870816
class java.lang.invoke.MethodHandleImpl$CountingWrapper	1115156	60218424

I will take another dump today and I will share it. The HEAP increased by 3gb per node (on average) overnight,
so I guess the diff of the two will reveal additional info.

That would be great. Just don't share the entire heap dump publicly since it might contain sensitive information about your cluster.

Hi,

It looks like another Groovy memory leak issue. Can you please share a sample of scripts that are executed? That might help to reproduce the issue.

Thanks

Indeed Groovy was the issue (sorry for the late reply). Removing groovy fixed the issue.
The scripts looked like his:

scriptScore(
      s"doc['SomeDoubleField'].value / (1 + Math.log(1 + doc['SomeIntField'].value)) "
    )

There were 2 variations of this thing, single line script scores performing basic math calculations on relatively small result sets.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.