Off-heap memory leak?

hokiegeek2 · April 16, 2018, 1:56pm

I continue to observe an elasticsearch on docker instance that continues to grab more and more off-heap memory until Mesos kills the docker container.

Since there is no data going in with the exception of the .monitor index and searches are pretty quiet on this ES instance, this is really peculiar. Note: the memory heap stays constant at just under 1GB.

Any ideas?

Thanks

--John

danielmitterdorfer · April 17, 2018, 10:46am

Hi,

wild guess: The ergonomic choice of the JVM for the maximum allowed direct memory is too high.

Unfortunately, it is a bit hard to find out which value the JVM has chosen (you can write a small Java program which reads sun.misc.VM.maxDirectMemory() and invoke it in the Docker container).

You can restrict direct memory explicitly e.g. to 2G with -XX:MaxDirectMemory=2G. Be aware that you will see more frequent garbage collections if the value you choose is too small. Also, ensure that -XX:+DisableExplicitGC is not set in your jvm.options (it was set by default before Elasticsearch 5.5.2).

Daniel

hokiegeek2 · April 17, 2018, 10:56am

Hi Daniel,

Thanks for your response! Um, I believe you mean -XX:MaxDirectMemorySize=2G. I tried that and it had no effect, with off-heap memory usage growing until Mesos killed the docker container.

--John

danielmitterdorfer · April 17, 2018, 11:16am

Hi John,

yes, I meant that parameter. Did you check what direct memory size the JVM has chosen for your container? I don't know anything about your environment and 2G might be the wrong choice as well; it was just an example how to set it.

If it is really not direct memory, I'd start digging whether native memory tracking reveals something.

Daniel

hokiegeek2 · April 17, 2018, 11:21am

Hi Daniel,

Cool, and I hope my reply did not seem snarky. There are likely JVM params I am unaware of, so just wanted to ensure I am trying out what you are suggesting.

Definitely great suggestion re: native memory tracking. I turned that on and restarted the container, monitoring now.

Thanks, and thanks again for the follow-up!

--John

hokiegeek2 · April 17, 2018, 11:31am

Also, I did confirm with ps -ef that (1) the only processes running in my container are ES and the ES plugin controller and (2) the ES process is the one gobbling up more and more off-heap memory

danielmitterdorfer · April 17, 2018, 11:51am

Hi John,

no, not at all. I just wanted to double-check if you might have set it to a value that is still too high.

That's not necessarily a disadvantage. Setting some exotic JVM parameter might bite you even years down the road, see our blog for an example. I think it pays off to be conservative and adjust the defaults only when really necessary.

I hope that reveals something. I know that finding the root cause can be time-consuming. But often you learn interesting stuff about the JVM or the kernel.

Daniel

hokiegeek2 · April 17, 2018, 12:23pm

Very interesting blog post, thanks for sharing that! BTW, jvm.options...please share what your recommended settings (at least the ones you start out with). It's understood that the Xms and Xmx settings will be different.

Thanks

--John

hokiegeek2 · April 17, 2018, 1:00pm

Interesting results. I did a jcmd 1 VM.native_memory baseline to establish a baseline and I have a script executing jcmd 1 VM.native_memory detail.diff and writing it to a file every 2 minutes.

The growth of native memory of the ES Java process is matching the increase in overall ES docker container memory usage per docker stats .

The main drivers are as follows:

Internal (reserved=356865KB +51292KB, committed=356861KB +51292KB)
(malloc=356829KB +51292KB #25812 +390)

Thread (reserved=260950KB +39298KB, committed=260950KB +39298KB)
(stack: reserved=259284 +39064KB, committed=259284KB +39064KB)

hokiegeek2 · April 17, 2018, 1:25pm

After monitoring for 1 hour-ish, it appears that the Thread stack memory is increasing way faster than Internal.

A couple of things:

-xss--I have this set to 1 MB (-xss1m).
Activity--this is a one-node cluster that has very few reads and even fewer writes.

I wonder if I have long-lived threads that are not being cleaned up due to a dearth of activity and that setting the tread stack size to a relatively high value accounts for more and more off-heap being used? I further wonder if it makes sense to remove -Xssl1m would make a diff. I am gonna try that and see what happens.

Definitely spit-balling here, but hey, what's the worst that can happen?

--John

danielmitterdorfer · April 18, 2018, 7:29am

I would not recommend reducing the thread stack size because you risk running into StackoverflowError. Instead, I'd recommend that you reduce the number of threads. To be clear: this can reduce performance but my impression is that you are more worried about memory usage than performance.

Also, you should track memory usage over a longer period of time than just one hour. The application needs time to warmup properly.

hokiegeek2 · April 18, 2018, 12:16pm

Hi Daniel,

Cool, thanks, makes sense re: thread stack size. My testing on that was not good as the off-heap memory usage actually appeared to accelerate.

A co-worker of mine had his thread_pool.generic.keep_alive set to 30s. I set that, kept the default thread stack size by removing -xss1m and let this run overnight. Interesting. Whereas the total native memory usage went up 521MB, the Thread stack portion only went up 197MB. It appears setting keep_alive to 30s helped and appears to point to thread stacks gobbling up page cache.

I restarted with a thread_pool.generic.max set to 100 and I am monitoring.

--John

hokiegeek2 · April 23, 2018, 4:02pm

Hi Daniel,

Just to recap, here's where we're at configuration-wise:

thread_pool.generic.keep_alive: 30s
thread_pool.generic.max: 100
thread_pool.index.size: 15
thread_pool.warmer.keep_alive: 30s

jvm.options includes the following:

-XX:MaxDirectMemorySize=2G, -Xms2g, -Xmx2g

And MALLOC_ARENA_MAX=4

At this point, ES keeps grabbing more and more off-heap memory until Mesos kills the Docker container. However, as I noted above, the percentage of RAM grabbed by the thread stack is way down from the first configuration I reported (35MB of 226278MB total, or 15%). Internal is 154MB or 68%.

Any ideas as to which knob(s) to turn next?

Thanks

--John

danielmitterdorfer · April 27, 2018, 9:10am

Hi John,

debugging that sort of issues is unfortunately a bit involved. Hendrik from our Cloud team has written an article about that: Tracking Down Native Memory Leaks in Elasticsearch. See especially the part about jemalloc.

Daniel

hokiegeek2 · April 27, 2018, 11:10am

Hi Daniel,

Yeah, saw that article, definitely a good one. Gonna re-engage on this today and see what I can find out.

Thanks again for the follow-up.

--John

system · May 25, 2018, 11:10am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Understanding off-heap usage Elasticsearch	6	5220	September 15, 2017
ES 5.6.4: offheap memory keeps growing Elasticsearch	2	504	April 17, 2018
What OS memory does es use other than Java? Elasticsearch	5	1557	July 6, 2017
Memory size settings Elasticsearch	4	2570	July 6, 2017
Memory problems Elasticsearch	27	1331	July 6, 2017

Off-heap memory leak?

Related topics