Elasticsearch uses more memory than JVM heap settings, reaches container memory limit and crash

Elasticsearch uses more memory than JVM heap settings, which is currently -Xms512m, -Xmx512m. I tried setting those values to 1g and reverted because the container crashed immediately after relaunching containers, because of OOM.

I run Elasticsearch 7.3.0 on ECK, and memory usage is reported by Prometheus node exporter.
Memory limlt is set to 1.5GiB to allocate some memory for EC2 instances, which are 3x t3.small instance with 2GB of RAM for each.

Is it bad idea to set memory limit for Elasticsearch containers? I'm not sure whether it includes virtual memory or not. If it does, that might cause the container to crash even when enough amount of memory is available.

Quoting the docs on setting the heap size (emphasis mine):

Set Xmx and Xms to no more than 50% of your physical RAM. Elasticsearch requires memory for purposes other than the JVM heap and it is important to leave space for this. For instance, Elasticsearch uses off-heap buffers for efficient network communication, relies on the operating system’s filesystem cache for efficient access to files, and the JVM itself requires some memory too. It is normal to observe the Elasticsearch process using more memory than the limit configured with the Xmx setting.

In a container, "physical RAM" means the memory limit of the container. If you have a 1.5GiB memory limit on your container then you must set the Elasticsearch heap size to no more than 0.75GiB.

Then what is my problem??
JVM heap is set to 512MB currently, which is less than 50% of container memory limit.
In my case memory usage sometimes exceeds 1.5GB and crash.
Container memory limit is 3 times large as heap size.

This guy on stackoverflow reports similar error as me, and unsetting memory limit fixes the problem.

He also says that linux kernel 4.15 fixes the memory issue.

The issue is still not fixed...?


Sorry, this wasn't clear. You said:

A 1GiB heap is definitely too large for a 1.5GiB container.

Yes there are known bugs in some kernels that inappropriately trigger the OOM killer in a container. That still doesn't mean it's a bad idea to set the memory limit on an Elasticsearch container, it just means it's a bad idea to use a buggy kernel.

If you think it's not that, please share the full dmesg output from such a crash; it could be thousands of lines long, so use https://gist.github.com/ if it doesn't fit here.

1 Like

Maybe is related to JVM "metaspace" usage, not heap. Check java MetaspaceSize and MaxMetaspaceSize settings (Xmx and Xms too, of course)

Try several settings for heap/metaspace, and monitor JVM heap/meta usage with "jstat" command before setting container memory limits.

https://docs.oracle.com/javase/8/docs/technotes/tools/windows/jstat.html

@jcastelc if you are seeing evidence of ongoing metaspace allocation in your cluster then I'd like to see more detail. It is rare to see metaspace memory pressure with Elasticsearch. I think its metaspace usage should be pretty much constant since I'm not aware of any dynamic loading happening after startup, and we account for this in the 2x limit described in the documentation.

1 Like

Good to know that. No evidence, it only was a suggestion. Thanks for the details! :slight_smile:

1 Like

My ES cluster crashed again, but memory usage of the ES container stayed within the 1.5GiB limit.
I've found that EBS burst credit for root volume was running out before the crash. This might be a problem other than memory usage...

This time I was able to get logs from failed node. It logged lots of warnings by JvmGcMonitorService. But this might be caused by loss of EBS burst credit for root volume.

EBS burst credit decreased slowly before the crush. Once it reaches to 0, the pod gets evicted because of slow I/O caused by that.
I should solve this problem first...
Thank you for the answers anyway.