There is insufficient memory for the Java Runtime Environment to continue

Aysel_Guliyeva · November 25, 2025, 4:10pm

Yeah for now I will not change heap size. Only set Dorg.apache.lucene.store.MMapDirectory.sharedArenaMaxPermits=1.

Really thank you so much. You helped me a lot)

Aysel_Guliyeva · November 27, 2025, 7:47am

I monitored 1 data node. When the Elasticsearch process reaches about 86% RAM usage in top (RSS ≈ 13.5 GB), the Native Memory Tracking report shows:

Total: reserved = 10402009 KB, committed = 9058861 KB

The maximum number of mappings at that moment is around 84,833.

So from NMT I see roughly 10 GB reserved (about 9 GB committed), when the process RSS even reaches to 14.2 GB.

Is this difference between RSS (13.5 GB) and NMT “Total reserved” (~10 GB) expected, or does it indicate a possible native memory leak?

Does the “Total” in the Native Memory Tracking report include memory used for Lucene’s mmapped index files and the OS page cache, or are those outside what NMT accounts for?

My questions are:

Is this difference between RSS (13.5 GB) and NMT “Total reserved” (~10 GB) expected, or does it indicate a possible native memory leak?
Does the “Total” in the Native Memory Tracking report include memory used for Lucene’s mmapped index files and the OS page cache, or are those outside what NMT accounts for?

RainTown · November 27, 2025, 8:55am

thanks for sticking with us …

There’s the heap and direct memory too, so I’d consider this “expected”. RSS its total resident set ( aka RAM used) by the JVM process, covering all the various “types” of memory.

Yes to the Lucene’s mmap-ed files. Operating system caches are not memory allocated to the JVM.

This is broadly what you shared before. And it is not growing?

Compare both RSS and that value with any other data nodes.

But also understand that the process’ RSS increasing is not in itself an indication of a leak.

EDIT: Consider to attach jconsole to all your data nodes to help monitor. An implicit assumption I’ve made here is all your data nodes are sort of equal in terms of spec, shards, query/ingest load, etc. Is that the case? Can you share output of a GET on (use DevTools)

_cat/nodes?v&h=name,ip,role,version,master,u,cpu,rc,rm,rp,hc,hm,hp,load_1m,load_5m,load_15m&bytes=b

Aysel_Guliyeva · November 28, 2025, 10:41am

name role version master u cpu rc rm rp hc hm hp load_1m load_5m load_15m
prod-elastic-data03 di 9.0.0 - 3.9d 12 16297820160 16728715264 97 6042545512 8589934592 70 1.21 0.84 0.84
prod-elastic-data08 di 9.0.0 - 5.4d 37 16398499840 16728768512 98 5905580032 8589934592 68 1.84 1.95 2.01
prod-elastic-data01 di 9.0.0 - 5.4d 21 15351324672 16875683840 91 4529254496 8589934592 52 2.89 2.47 2.37
prod-elastic-data07 di 9.0.0 - 5.5d 17 16552493056 16875651072 98 4536139776 8589934592 52 1.71 1.69 1.56
prod-elastic-data02 di 9.0.0 - 4.4d 22 16544157696 16875622400 98 6079643648 8589934592 70 2.12 2.31 2.06
prod-elastic-data05 di 9.0.0 - 3.7d 29 16554582016 16875663360 98 3088269760 8589934592 35 2.33 2.19 2.30
prod-elastic-data06 di 9.0.0 - 5.6d 33 16544481280 16875675648 98 3870582392 8589934592 45 0.96 1.00 1.20
prod-elastic-data04 di 9.0.0 - 3.8d 9 16528756736 16875642880 98 6221453072 8589934592 72 0.25 0.25 0.30

Aysel_Guliyeva · November 28, 2025, 10:46am

Thank you for explanation! I can see the similar results for the data nodes.

RainTown · November 28, 2025, 8:34pm

Best not to paste pictures of text please, just paste the text itself.

Any more crashes?

I see you changed the ip to io, thats fine, we don’t really care about your IP addresses.

tbh your output looks pretty healthy to me, except none of your data nodes have been up for long, I guess as a result of restarting them with different JVM options. Across your data nodes the heap varies a bit, but I’ve no idea where in any GC cycle each node is (console is good to watch this). Cluster is not significantly loaded. The fact that you are on 9.0.0 stands out a bit, that’d be fine if it was released last week, but looks like you upgraded at least 72 days ago. Was 9.0.0 the latest release when you did the upgrade? (unlikely as 9.0.1 followed fairly soon, and other point releases since then too)

As of right now I’m thinking it’s actually now unlikely that you have hit the specific bug we’ve discussed, you seem to get a different error and your counts seem well short of the limit, and you say wc -l /proc/$pid/maps is fairly consistent across all the data nodes?

Outside of doing an upgrade, I’m not sure what to suggest now to nail down the issue.

Others are welcome to chime in.

stephenb · November 30, 2025, 5:41pm

Hi @Aysel_Guliyeva Welcome to the community... I see you are getting lots of help.

I was a bit surprise at the response count so I thought I would drop in.

I see you are getting lots of detailed help...

AND so I will go back to the beginning

If this is an important elasticsearch cluster (i.e the data nodes) it is an anti-pattern to run other applications on the same VMs... the fact you are trying to calculate the memory by apps like Grafana etc... add them up and make them all fit in ... is a bad plan .. period.

This is a classic pattern of memory competition... there are lots of reasons why application memory can spike and as soon as elaticsearch can not get the memory it requires... poof!

You are running out of memory.. you are most likely colliding with some other app that is claiming memory. Running TOP here and there is not going to give you enough insight..

On top of this, if the approach in the environment is high utilization (which is a completely valid approach) if the underlying Virtualization is NOT pinning CPU and Memory Elastic can and will be unstable... i.e. is the underlying VM "Thin Provisioned."

I used to give the talk at ElasticOns on this very topic... I have seen it over and over again.

All these other settings are a valid discussion, but my perspective is that they are mostly likely not the root cause nor fix of the basic underlying issue.

Elastic node with Dedicated Host or VMs (with dedicated resources) = Best / Stable Outcome

Aysel_Guliyeva · December 1, 2025, 7:05am

Kevin, thank you for answer. Yes I changed jvm options for data nodes, so I restarted it. I think 5 months ago, Elasticsearch version was upgraded to 9.0.0. Now none of data nodes have been crashed during our discussion.

Aysel_Guliyeva · December 1, 2025, 7:07am

Hello. Thank you!

Aysel_Guliyeva · December 1, 2025, 7:10am

Node exporter only exists In all Elasticsearch data nodes. Other apps such as Grafana, Prometheus are located on separate VMs.

Aysel_Guliyeva · December 1, 2025, 7:21am

RAM, CPU and disk are statically reserved for Elasticsearch data nodes.

Aysel_Guliyeva · December 1, 2025, 7:22am

Thank you for explanation

RainTown · December 1, 2025, 10:41am

Thanks for sharing the various updates. To summarize where we are now (correct if I got anything wrong):

All your ES instances, and specifically the data nodes, are not overcommitted in memory.
Since you restarted to add the PrintNMTStatistics/NativeMemoryTracking JVM option, and the Dorg.apache.lucene.store.MMapDirectory.sharedArenaMaxPermits=1 option, none of them have crashed
Aside from ES, the only thing you think is running in the VMs is node exporter.
Still running 9.0.0, same stack size, only change is that above settings are added
We are still sort of expecting a data node might crash at some point, but when is not predictable
None of the nodes are showing any sort of slow leak in value of wc -l /proc/$pid/maps
crashes, when they came, had no particular pattern, at different times of day, different nodes.

A couple of questions:

Is the data node crashing issue new only after the upgrade to 9.0.0, or was it already present with 8.13.4 ?
Just for completeness, we are talking about "VM” here without being specific. Are these VMware VMs, VirtualBox VMs, docker containers, something else, … ?
Please double check the VMs dont have other services. e.g. some sort of auto-updater, security scan, reporting tool, … something else outside the ES processes and node exporter.
All data nodes are sort of equal and indices, particular any sort of hot indices, are fairly symmetric across all the data nodes?

RainTown · December 7, 2025, 11:32pm

any update @Aysel_Guliyeva ?

Aysel_Guliyeva · December 8, 2025, 3:31pm

Hello, Kevin. Thank you for answer.

All of these points are true.

The answers to your questions:

Yes. But before we didn’t have a lot of latency rules for services in APM in Elasticsearch 8.13.4. Now latency rule for each service, rate limit rule and another custom ESQL rule for custom services are running on APM. But before they didn’t exist. So I can’t compare the current situation with Elasticsearch 9.0.0 with Elasticsearch 8.13.4

All Elasticsearch nodes including data nodes are VMware VMs
Before I checked RSS via top. And I added script for checking rss of Elasticsearch. Sometimes RSS for Elasticsearch data node can be up to 14GB.
Yes, hot indices, are fairly symmetric across all the data nodes

Aysel_Guliyeva · December 8, 2025, 3:55pm

So far, the Elasticsearch service has been stable and has not crashed on any data nodes.

RainTown · December 8, 2025, 4:42pm

So, either the setting above fixed it, or we are still waiting for next crash.

To guess if it's the first, I'd say somewhere around 2 times the longest recent gap between data nodes crashes is a decent estimate. So, if it run for 10 days without crashing sometime in Nov say, and that's the best recent no-crash gap, then need wait til ca: 20 days now. Even then cannot be sure, but evidence would be mounting.

If latter, important to capture as much info as possible about the crash when it crashes. Run the jcmd command every minute or so, maybe with VM.native_memory detail and Thread.print options, watch the jconsole, ...

Also, there is no swap partition on the VMs, right? Nor on the hosts hosting the VMs?

( and I would welcome more input from @stephenb too )

stephenb · December 8, 2025, 6:30pm

BTW 9.0.0 ... you will probably want to get to 9.2.2+ at some point, already many improvements.

Funny you mention this... I spent a looong time with another user in that case the elastic process was dying / killed unexpectedly... turned out a new corp security scan / qualys I think that was not recognizing the process and killing it... that was painfull.

RainTown · December 8, 2025, 6:39pm

we were there already, was answered with:

Yep, tho likely qualsys. Stuff like this needs checking, because the "I" in "RACI" is very rarely respected.

RainTown · December 14, 2025, 6:36pm

@Aysel_Guliyeva - are we still waiting for that next crash? what does a GET on

return now?

Topic		Replies	Views
Memory problems Elasticsearch	27	1391	July 6, 2017
OOM since 8.16.1 with openjdk23 Elasticsearch runtime-fields	32	1002	May 2, 2025
ES eating all memory despite JVM startup configuration Elasticsearch	8	902	July 5, 2017
What OS memory does es use other than Java? Elasticsearch	5	1583	July 6, 2017
Newbie - memory issues Elasticsearch	3	453	July 6, 2017

There is insufficient memory for the Java Runtime Environment to continue

Related topics