after upgrading from 8.15.1 to 8.16.1, all machines in two of our four ES clusters are running out of Memory after 7-12 hours.
Our current setup for all clusters is:
System: Ubuntu 22
3 master
3 data nodes (64GB RAM. 32GB Xmx. 48 CPUs)
2 kibana nodes
The internal ES monitoring also does not show any issues with Heap. I can not directly upload it here, as the company does not allow it. If you need it, I can upload it to a image hoster of your choice.
This is the message I can find in my syslog:
Dec 3 01:20:46 datanode1 systemd-entrypoint[296225]: # There is insufficient memory for the Java Runtime Environment to continue.
Dec 3 01:20:46 datanode1 systemd-entrypoint[296225]: # Native memory allocation (malloc) failed to allocate 1048576 bytes. Error detail: AllocateHeap
Dec 3 01:20:46 datanode1 systemd-entrypoint[296225]: # An error report file with more information is saved as:
Dec 3 01:20:46 datanode1 systemd-entrypoint[296225]: # /var/log/elasticsearch/hs_err_pid296225.log
I can also share the JVM fatal error log and the last gc logs before the crash.
The gc.log always ends with a full pause before a crash
[2024-12-03T01:20:46.380+0000][296225][gc,start ] GC(632) Pause Full (System.gc())
[2024-12-03T01:20:46.380+0000][296225][gc,task ] GC(632) Using 33 workers of 33 for full compaction
I didn't change anything in the jvm.options and only set Xmx/XMs and LimitMEMLOCK=infinity
I think it could be related to the switch to openjdk 23, as we are using the java coming with ES. Do you think its worth to install openjdk22 on the system and use it? Otherwise I have no clue how to get out of this.
Even adding 18GB of ram (to 82GB) did not work - still OOM.
Even after trying all of the mentioned tipps, we still have OOM twice a day on all data nodes. Even the node with openjdk22. So it looks not an issue with the new openjdk.
We are ingesting data with logstash. Maybe the pressure on ES is too high? Should we try to decrease the batch size?
If a mode is consistently going to crash, you can monitor with some other old-school tools, top/htop/vmstat/..., til time it crashes, pipe output to files and look at them after the crash.
I am curious if there is growing memory pressure until it crashes, or rather something goes wild/wrong, and it snowballs very rapidly.
I'd just like to add I've reported the same issue happening on both of our clusters since upgrading from 8.15 to 8.17
More details are in the thread but quick summary:
Two completely isolated clusters of completely different sizes are showing OOM's since upgrading.
The search & indexing pattern is consistent as it has always been, and we've seen zero OOM's in the past year or so we've been running 8.x on these clusters, only since upgrading to 8.17 a couple of days ago have we seen 10+ OOM's across nodes
The OOM's seem to happen across all our hot nodes over a couple of hour period (i.e. all hot nodes will OOM once within a given period).
We did not change JVM version in the 8.16 -> 8.17 upgrade
I set this to an artificially lower value in our testing environment and waited for this value to be reached by the Elastic process; the process exited with the exact same error I've been seeing.
I've now doubled this value on our clusters to see if it prevents, or delays, the OOM's we've been seeing. I've already observed that the number of memory regions being used on some of our hot nodes is already greater than the previous limit, so I'm more confident this is the source of the problem. Whether things will just grow to the next limit or not I don't know.
I would be interested if you're able to observe the same in your cluster. You can do wc -l /proc/<PID>/maps to see the current number in use
It would be useful if someone can provide the hs_err_pidXXXX.log generated when it crashes, either here, on the github issue or as a gist.
@ALIT Is this something you'd be able to capture & share given you're still seeing the errors?
For us the hs_err_pidXXXX.log is written to a location that is not persisted across pod restarts so would need to make changes to make that, or move that, to somewhere persistent.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.