Off-heap memory leak?

I continue to observe an elasticsearch on docker instance that continues to grab more and more off-heap memory until Mesos kills the docker container.

Since there is no data going in with the exception of the .monitor index and searches are pretty quiet on this ES instance, this is really peculiar. Note: the memory heap stays constant at just under 1GB.

Any ideas?

Thanks

--John

Hi,

wild guess: The ergonomic choice of the JVM for the maximum allowed direct memory is too high.

Unfortunately, it is a bit hard to find out which value the JVM has chosen (you can write a small Java program which reads sun.misc.VM.maxDirectMemory() and invoke it in the Docker container).

You can restrict direct memory explicitly e.g. to 2G with -XX:MaxDirectMemory=2G. Be aware that you will see more frequent garbage collections if the value you choose is too small. Also, ensure that -XX:+DisableExplicitGC is not set in your jvm.options (it was set by default before Elasticsearch 5.5.2).

Daniel

Hi Daniel,

Thanks for your response! Um, I believe you mean -XX:MaxDirectMemorySize=2G. :slight_smile: I tried that and it had no effect, with off-heap memory usage growing until Mesos killed the docker container.

--John

Hi John,

yes, I meant that parameter. Did you check what direct memory size the JVM has chosen for your container? I don't know anything about your environment and 2G might be the wrong choice as well; it was just an example how to set it.

If it is really not direct memory, I'd start digging whether native memory tracking reveals something.

Daniel

Hi Daniel,

Cool, and I hope my reply did not seem snarky. There are likely JVM params I am unaware of, so just wanted to ensure I am trying out what you are suggesting.

Definitely great suggestion re: native memory tracking. I turned that on and restarted the container, monitoring now.

Thanks, and thanks again for the follow-up!

--John

Also, I did confirm with ps -ef that (1) the only processes running in my container are ES and the ES plugin controller and (2) the ES process is the one gobbling up more and more off-heap memory

Hi John,

no, not at all. I just wanted to double-check if you might have set it to a value that is still too high. :slight_smile:

That's not necessarily a disadvantage. Setting some exotic JVM parameter might bite you even years down the road, see our blog for an example. I think it pays off to be conservative and adjust the defaults only when really necessary.

I hope that reveals something. I know that finding the root cause can be time-consuming. But often you learn interesting stuff about the JVM or the kernel. :slight_smile:

Daniel

Very interesting blog post, thanks for sharing that! BTW, jvm.options...please share what your recommended settings (at least the ones you start out with). It's understood that the Xms and Xmx settings will be different.

Thanks

--John

Interesting results. I did a jcmd 1 VM.native_memory baseline to establish a baseline and I have a script executing jcmd 1 VM.native_memory detail.diff and writing it to a file every 2 minutes.

The growth of native memory of the ES Java process is matching the increase in overall ES docker container memory usage per docker stats .

The main drivers are as follows:

Internal (reserved=356865KB +51292KB, committed=356861KB +51292KB)
(malloc=356829KB +51292KB #25812 +390)

Thread (reserved=260950KB +39298KB, committed=260950KB +39298KB)
(stack: reserved=259284 +39064KB, committed=259284KB +39064KB)

After monitoring for 1 hour-ish, it appears that the Thread stack memory is increasing way faster than Internal.

A couple of things:

  1. -xss--I have this set to 1 MB (-xss1m).
  2. Activity--this is a one-node cluster that has very few reads and even fewer writes.

I wonder if I have long-lived threads that are not being cleaned up due to a dearth of activity and that setting the tread stack size to a relatively high value accounts for more and more off-heap being used? I further wonder if it makes sense to remove -Xssl1m would make a diff. I am gonna try that and see what happens.

Definitely spit-balling here, but hey, what's the worst that can happen? :rofl:

--John

I would not recommend reducing the thread stack size because you risk running into StackoverflowError. Instead, I'd recommend that you reduce the number of threads. To be clear: this can reduce performance but my impression is that you are more worried about memory usage than performance.

Also, you should track memory usage over a longer period of time than just one hour. The application needs time to warmup properly.

Hi Daniel,

Cool, thanks, makes sense re: thread stack size. My testing on that was not good as the off-heap memory usage actually appeared to accelerate.

A co-worker of mine had his thread_pool.generic.keep_alive set to 30s. I set that, kept the default thread stack size by removing -xss1m and let this run overnight. Interesting. Whereas the total native memory usage went up 521MB, the Thread stack portion only went up 197MB. It appears setting keep_alive to 30s helped and appears to point to thread stacks gobbling up page cache.

I restarted with a thread_pool.generic.max set to 100 and I am monitoring.

--John

Hi Daniel,

Just to recap, here's where we're at configuration-wise:

thread_pool.generic.keep_alive: 30s
thread_pool.generic.max: 100
thread_pool.index.size: 15
thread_pool.warmer.keep_alive: 30s

jvm.options includes the following:

-XX:MaxDirectMemorySize=2G, -Xms2g, -Xmx2g

And MALLOC_ARENA_MAX=4

At this point, ES keeps grabbing more and more off-heap memory until Mesos kills the Docker container. However, as I noted above, the percentage of RAM grabbed by the thread stack is way down from the first configuration I reported (35MB of 226278MB total, or 15%). Internal is 154MB or 68%.

Any ideas as to which knob(s) to turn next?

Thanks

--John

Hi John,

debugging that sort of issues is unfortunately a bit involved. Hendrik from our Cloud team has written an article about that: Tracking Down Native Memory Leaks in Elasticsearch. See especially the part about jemalloc.

Daniel

Hi Daniel,

Yeah, saw that article, definitely a good one. Gonna re-engage on this today and see what I can find out.

Thanks again for the follow-up.

--John

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.