OOM since 8.16.1 with openjdk23

ALIT · December 3, 2024, 2:39pm

Hi,

after upgrading from 8.15.1 to 8.16.1, all machines in two of our four ES clusters are running out of Memory after 7-12 hours.

Our current setup for all clusters is:
System: Ubuntu 22
3 master
3 data nodes (64GB RAM. 32GB Xmx. 48 CPUs)
2 kibana nodes

The internal ES monitoring also does not show any issues with Heap. I can not directly upload it here, as the company does not allow it. If you need it, I can upload it to a image hoster of your choice.

This is the message I can find in my syslog:

Dec  3 01:20:46 datanode1 systemd-entrypoint[296225]: # There is insufficient memory for the Java Runtime Environment to continue.
Dec  3 01:20:46 datanode1  systemd-entrypoint[296225]: # Native memory allocation (malloc) failed to allocate 1048576 bytes. Error detail: AllocateHeap
Dec  3 01:20:46 datanode1  systemd-entrypoint[296225]: # An error report file with more information is saved as:
Dec  3 01:20:46 datanode1  systemd-entrypoint[296225]: # /var/log/elasticsearch/hs_err_pid296225.log

I can also share the JVM fatal error log and the last gc logs before the crash.
The gc.log always ends with a full pause before a crash

[2024-12-03T01:20:46.380+0000][296225][gc,start    ] GC(632) Pause Full (System.gc())
[2024-12-03T01:20:46.380+0000][296225][gc,task     ] GC(632) Using 33 workers of 33 for full compaction

I didn't change anything in the jvm.options and only set Xmx/XMs and LimitMEMLOCK=infinity

I think it could be related to the switch to openjdk 23, as we are using the java coming with ES. Do you think its worth to install openjdk22 on the system and use it? Otherwise I have no clue how to get out of this.
Even adding 18GB of ram (to 82GB) did not work - still OOM.

Thanks!

elasticforme · December 3, 2024, 5:23pm

can you try 31gig ram in jvm.options?

elasticforme · December 3, 2024, 5:43pm

you can also try with following two value in sysctl.conf file

vm.overcommit_memory=2
vm.overcommit_ratio=85

ALIT · December 4, 2024, 3:15pm

Thanks for the tips. I will try both of them on different machines and get back to you.

I also tried an openjdk22 on one of the machines, but it still crashed.

ALIT · December 13, 2024, 1:34pm

Even after trying all of the mentioned tipps, we still have OOM twice a day on all data nodes. Even the node with openjdk22. So it looks not an issue with the new openjdk.

We are ingesting data with logstash. Maybe the pressure on ES is too high? Should we try to decrease the batch size?

elasticforme · December 13, 2024, 4:21pm

I upgraded my test bed last week and no issue so far

also using same openjdk provided by elastic

java -version
openjdk version "23" 2024-09-17
OpenJDK Runtime Environment (build 23+37-2369)
OpenJDK 64-Bit Server VM (build 23+37-2369, mixed mode, sharing)

how much data you are writing?
I have lot of data in test but I let logstash handle limit and not putting anything to it.

cat ../logstash.yml |grep -v ^#
node.name: "myhostname"
path.data: /s1/logstash
pipeline.batch.size: 256

http.host: "myhostname"
http.port: 9600
log.level: info
path.logs: /s1/log/logstash

My jvm looks like this

cat ../jvm.options |grep -v ^#
-Xms15g
-Xmx15g
14:-XX:+UseG1GC

logstash is running on it's own vm.

RainTown · December 14, 2024, 10:14am

If a mode is consistently going to crash, you can monitor with some other old-school tools, top/htop/vmstat/..., til time it crashes, pipe output to files and look at them after the crash.

I am curious if there is growing memory pressure until it crashes, or rather something goes wild/wrong, and it snowballs very rapidly.

might be helpful

Evesy · December 20, 2024, 10:39am

I'd just like to add I've reported the same issue happening on both of our clusters since upgrading from 8.15 to 8.17

More details are in the thread but quick summary:

Two completely isolated clusters of completely different sizes are showing OOM's since upgrading.
The search & indexing pattern is consistent as it has always been, and we've seen zero OOM's in the past year or so we've been running 8.x on these clusters, only since upgrading to 8.17 a couple of days ago have we seen 10+ OOM's across nodes
The OOM's seem to happen across all our hot nodes over a couple of hour period (i.e. all hot nodes will OOM once within a given period).
We did not change JVM version in the 8.16 -> 8.17 upgrade

Evesy · December 20, 2024, 4:11pm

@ALIT What do you have set on your machine for the value of vm.max_map_count if you run sysctl -a?

We've always had this set to 262144 as per Elastic's recommendation

I set this to an artificially lower value in our testing environment and waited for this value to be reached by the Elastic process; the process exited with the exact same error I've been seeing.

I've now doubled this value on our clusters to see if it prevents, or delays, the OOM's we've been seeing. I've already observed that the number of memory regions being used on some of our hot nodes is already greater than the previous limit, so I'm more confident this is the source of the problem. Whether things will just grow to the next limit or not I don't know.

I would be interested if you're able to observe the same in your cluster. You can do wc -l /proc/<PID>/maps to see the current number in use

RainTown · December 29, 2024, 4:03pm

Out of curiosity, did either @Evesy or @ALIT resolve the memory issue?

Evesy · December 30, 2024, 10:35am

@RainTown Not seen any issues since increasing the limit mentioned above so I would say it's resolved in that sense

ALIT · December 30, 2024, 11:52am

Which limit did you increase? vm.max_map_count ? Whats your new limit to prevent the crash?

sysctl -a | grep max_map_count
vm.max_map_count = 262144

wc -l /proc/3919887/maps
215 /proc/3919887/maps

Evesy · December 31, 2024, 9:07am

We just doubled it to see where that would leave us, and we did subsequently observe the used value get to the circa 400k mark after increasing

The amount being used in your output looks really low, but we did observe ours starting low and then growing over the next 24 hours of operation

ALIT · January 1, 2025, 6:08pm

I just realized, I checked the wrong process.
This is the correct process:

173802 /proc/1841515/maps

I increased it to 500k to see if it helps.

ALIT · January 2, 2025, 3:56pm

No crash on the reconfigured node. I think you nailed it @Evesy

current stats:

466096 /proc/1841515/maps

RainTown · January 2, 2025, 7:05pm

The change in behaviour, in 8.16.1+ which both of you reported, seems worthy of a bug report to me.

Evesy · January 8, 2025, 3:20pm

@ALIT Everything still looking ok since you made the change?

I've opened Elasticsearch 8.16.x Large Increase in MMAP Counts · Issue #119652 · elastic/elasticsearch · GitHub as a bug

ALIT · January 9, 2025, 9:13am

I changed it on 31th Dec.
I had 3 crashes since. So the issue is still there, but a lot less crashes.

I guess I could increase max_map_count even higher to solve it, but I have no idea of the impact to the system.

Ignacio_Vera · January 9, 2025, 12:27pm

It would be useful if someone can provide the hs_err_pidXXXX.log generated when it crashes, either here, on the github issue or as a gist.

Evesy · January 9, 2025, 1:46pm

It would be useful if someone can provide the hs_err_pidXXXX.log generated when it crashes, either here, on the github issue or as a gist.

@ALIT Is this something you'd be able to capture & share given you're still seeing the errors?

For us the hs_err_pidXXXX.log is written to a location that is not persisted across pod restarts so would need to make changes to make that, or move that, to somewhere persistent.

Topic		Replies	Views
After upgrade to .16, problems Elasticsearch	13	408	July 6, 2017
Memory problems Elasticsearch	27	1354	July 6, 2017
Memory issue in 0.15.1 Elasticsearch	8	575	July 6, 2017
Constantly increasing memory outside of Java heap Elasticsearch	11	1846	July 6, 2017
SearchContextMissingException and out-of-memory crash Elasticsearch	8	503	July 6, 2017

OOM since 8.16.1 with openjdk23

Related topics