Elastic cluster is getting down after 2 - 3 hours

The sequence of logs shared in the OP are not consistent with the shutdown being caused by the OOM killer. The OOM killer sends a SIGKILL which causes immediate process exit, not a graceful shutdown. You don’t get any log messages in that case, the process just dies. If you were running it from the command line then sometimes you get the one-word message Killed on the last line, and an exit code of 137.

The logs in the OP are also not consistent with a Java OOM exception. That tends to be rather visible in the logs and again doesn’t go through the full stopping and closing sequence seen here.

It is unclear whether all of the shutdowns in question involve those log messages or whether @sathish12 just happened to pick an outlier. But all the ones that say stopping and closing are graceful shutdowns.

Note that if you run Elasticsearch from the console using & and then close the console then I expect the process will shut down like this. That’d be my first suspicion.

2 Likes
/var/log/messages-20251123:Nov 21 11:29:56 x.x.x.x kernel: GC Thread#0 invoked oom-killer: gfp_mask=0x6280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), order=0, oom_score_adj=0
/var/log/messages-20251123:Nov 21 11:29:57 x.x.x.x kernel: oom_kill_process.cold.32+0xb/0x10
/var/log/messages-20251123:Nov 21 11:29:59 x.x.x.x kernel: [   2127]    42  2127    57597        0   208896      251             0 gsd-rfkill
/var/log/messages-20251123:Nov 21 11:29:59 x.x.x.x kernel: oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/user.slice/user-1002.slice/session-40.scope,task=java,pid=119441,uid=1002
/var/log/messages-20251123:Nov 21 11:29:59 x.x.x.x kernel: Out of memory: Killed process 119441 (java) total-vm:20097204kB, anon-rss:9454160kB, file-rss:0kB, shmem-rss:0kB, UID:1002 pgtables:33412kB oom_score_adj:0
497567  497290 14:23       40:57 26412 /home/demouser/Europa/jdk/bin/java -Xms4m -Xmx64m -XX:+UseSerialGC -Dcli.name=server -Dcli.script=Europa/bin/elasticsearch -Dcli.libs=lib/tools/server-cli -Des.path.home=/home/demouser/Europa -Des.path.conf=/home/demouser/Europa/config -Des.distribution.type=tar -cp /home/demouser/Europa/lib/*:/home/demouser/Europa/lib/cli-launcher/* org.elasticsearch.launcher.CliToolLauncher
 
 497632  497567 14:23       40:54 9410404 /home/demouser/Europa/jdk/bin/java -Des.networkaddress.cache.ttl=60 -Des.networkaddress.cache.negative.ttl=10 -Djava.security.manager=allow -XX:+AlwaysPreTouch -Xss1m -Djava.awt.headless=true -Dfile.encoding=UTF-8 -Djna.nosys=true -XX:-OmitStackTraceInFastThrow -Dio.netty.noUnsafe=true -Dio.netty.noKeySetOptimization=true -Dio.netty.recycler.maxCapacityPerThread=0 -Dlog4j.shutdownHookEnabled=false -Dlog4j2.disable.jmx=true -Dlog4j2.formatMsgNoLookups=true -Djava.locale.providers=SPI,COMPAT --add-opens=java.base/java.io=org.elasticsearch.preallocate --enable-native-access=org.elasticsearch.nativeaccess -XX:ReplayDataFile=logs/replay_pid%p.log -Des.distribution.type=tar -Xms7g -Xmx7g -Xlog:gc*,gc+age=trace,safepoint:file=logs/gc.log:utctime,level,pid,tags:filecount=32,filesize=64m -Xms7g -Xmx7g -Xlog:gc*,gc+age=trace,safepoint:file=logs/gc.log:utctime,level,pid,tags:filecount=32,filesize=64m -XX:MaxDirectMemorySize=3758096384 -XX:G1HeapRegionSize=4m -XX:InitiatingHeapOccupancyPercent=30 -XX:G1ReservePercent=15 --module-path /home/demouser/Europa/lib --add-modules=jdk.net --add-modules=ALL-MODULE-PATH -m org.elasticsearch.server/org.elasticsearch.bootstrap.Elasticsearch
 
 497655  497632 14:23       40:52   132 /home/demouser/Europa/modules/x-pack-ml/platform/linux-x86_64/bin/controller
 
 497663  497290 14:23       40:52 77124 /home/demouser/Kale/jdk/bin/java -Xms4m -Xmx64m -XX:+UseSerialGC -Dcli.name=server -Dcli.script=Kale/bin/elasticsearch -Dcli.libs=lib/tools/server-cli -Des.path.home=/home/demouser/Kale -Des.path.conf=/home/demouser/Kale/config -Des.distribution.type=tar -cp /home/demouser/Kale/lib/*:/home/demouser/Kale/lib/cli-launcher/* org.elasticsearch.launcher.CliToolLauncher
 
 497727  497663 14:23       40:48 10267476 /home/demouser/Kale/jdk/bin/java -Des.networkaddress.cache.ttl=60 -Des.networkaddress.cache.negative.ttl=10 -Djava.security.manager=allow -XX:+AlwaysPreTouch -Xss1m -Djava.awt.headless=true -Dfile.encoding=UTF-8 -Djna.nosys=true -XX:-OmitStackTraceInFastThrow -Dio.netty.noUnsafe=true -Dio.netty.noKeySetOptimization=true -Dio.netty.recycler.maxCapacityPerThread=0 -Dlog4j.shutdownHookEnabled=false -Dlog4j2.disable.jmx=true -Dlog4j2.formatMsgNoLookups=true -Djava.locale.providers=SPI,COMPAT --add-opens=java.base/java.io=org.elasticsearch.preallocate --enable-native-access=org.elasticsearch.nativeaccess -XX:ReplayDataFile=logs/replay_pid%p.log -Des.distribution.type=tar -Xms7g -Xmx7g -Xlog:gc*,gc+age=trace,safepoint:file=logs/gc.log:utctime,level,pid,tags:filecount=32,filesize=64m -Xms7g -Xmx7g -Xlog:gc*,gc+age=trace,safepoint:file=logs/gc.log:utctime,level,pid,tags:filecount=32,filesize=64m -XX:MaxDirectMemorySize=3758096384 -XX:G1HeapRegionSize=4m -XX:InitiatingHeapOccupancyPercent=30 -XX:G1ReservePercent=15 --module-path /home/demouser/Kale/lib --add-modules=jdk.net --add-modules=ALL-MODULE-PATH -m org.elasticsearch.server/org.elasticsearch.bootstrap.Elasticsearch
 
 497751  497727 14:23       40:46   852 /home/demouser/Kale/modules/x-pack-ml/platform/linux-x86_64/bin/controller
# ======================== Elasticsearch Configuration =========================
#
# NOTE: Elasticsearch comes with reasonable defaults for most settings.
#       Before you set out to tweak and tune the configuration, make sure you
#       understand what are you trying to accomplish and the consequences.
#
# The primary way of configuring a node is via this file. This template lists
# the most important settings you may want to configure for a production cluster.
#
# Please consult the documentation for further information on configuration options:
# https://www.elastic.co/guide/en/elasticsearch/reference/index.html
#
# ---------------------------------- Cluster -----------------------------------
#
# Use a descriptive name for your cluster:
#
cluster.name: demoCLuster
#
# ------------------------------------ Node ------------------------------------
#
# Use a descriptive name for the node:
#
node.name: Europa
#
# Add custom attributes to the node:
#
#node.attr.rack: r1
#
# ----------------------------------- Paths ------------------------------------
#
# Path to directory where to store the data (separate multiple locations by comma):
#
#path.data: /path/to/data
#
# Path to log files:
#
#path.logs: /path/to/logs
#
# ----------------------------------- Memory -----------------------------------
#
# Lock the memory on startup:
#
#bootstrap.memory_lock: true
#
# Make sure that the heap size is set to about half the memory available
# on the system and that the owner of the process is allowed to use this
# limit.
#
# Elasticsearch performs poorly when the system is swapping the memory.
#
# ---------------------------------- Network -----------------------------------
#
# By default Elasticsearch is only accessible on localhost. Set a different
# address here to expose this node on the network:
#
network.host: x.x.x.x
#
# By default Elasticsearch listens for HTTP traffic on the first free port it
# finds starting at 9200. Set a specific HTTP port here:
#
http.port: 9200
transport.port: 9300
#
# For more information, consult the network module documentation.
#
# --------------------------------- Discovery ----------------------------------
#
# Pass an initial list of hosts to perform discovery when this node is started:
# The default list of hosts is ["127.0.0.1", "[::1]"]
#
#discovery.seed_hosts: ["host1", "host2"]
#
# Bootstrap the cluster using an initial set of master-eligible nodes:
#
#cluster.initial_master_nodes: ["node-1", "node-2"]
#
# For more information, consult the discovery and cluster formation module documentation.
#
# ---------------------------------- Various -----------------------------------
#
# Allow wildcard deletion of indices:
#
#action.destructive_requires_name: false

#----------------------- BEGIN SECURITY AUTO CONFIGURATION -----------------------
#
# The following settings, TLS certificates, and keys have been automatically      
# generated to configure Elasticsearch security features on 24-06-2024 14:28:16
#
# --------------------------------------------------------------------------------

# Enable security features
xpack.security.enabled: true

xpack.security.enrollment.enabled: true

# Enable encryption for HTTP API client connections, such as Kibana, Logstash, and Agents
xpack.security.http.ssl:
  enabled: true
  keystore.path: certs/http.p12

# Enable encryption and mutual authentication between cluster nodes
xpack.security.transport.ssl:
  enabled: true
  verification_mode: certificate
  keystore.path: certs/transport.p12
  truststore.path: certs/transport.p12
# Discover existing nodes in the cluster
#discovery.seed_hosts: ["a.b.c.d:9300"]
#cluster.initial_master_nodes: ["a.b.c.d:9300","x.x.x.y:9300","x.x.x.z:9300"]
#cluster.initial_master_nodes: ["Europa","Elara","Kore"]
discovery.seed_hosts: ["x.x.x.x:9300", "x.x.x.x:9301", "x.x.x.y:9300", "x.x.x.y:9301","x.x.x.z:9300","x.x.x.z:9301"]
node.roles: [ master, data ]

# Allow HTTP API connections from anywhere
# Connections are encrypted and require user authentication
http.host: 0.0.0.0

# Allow other nodes to join the cluster from anywhere
# Connections are encrypted and mutually authenticated
transport.host: 0.0.0.0

#----------------------- END SECURITY AUTO CONFIGURATION -------------------------

I am using MRemoteNG to connect to the VMs. So I cannot monitor it till the end. As soon as I disconnect the network, the mremoteng session wears off

Hi @DavidTurner

Do you need any other information, Let me know

I think @DavidTurner has pinpointed it already.

I started elasticsearch, over ssh connection, in background.

% ps -ukevin | fgrep java
1651997 pts/1    00:00:02 java
1652057 pts/1    00:00:30 java

I then exit the shell. First of all I see

% exit
zsh: you have running jobs.

Then I exit anyways and see:

zsh: warning: 1 jobs SIGHUPed
[2025-12-12T11:50:12,734][INFO ][o.e.x.m.p.NativeController] [u2024] Native controller process has stopped - no new native processes can be started
[2025-12-12T11:50:12,737][INFO ][o.e.n.Node               ] [u2024] stopping ...
[2025-12-12T11:50:12,738][INFO ][o.e.c.f.AbstractFileWatchingService] [u2024] shutting down watcher thread
[2025-12-12T11:50:12,739][INFO ][o.e.c.f.AbstractFileWatchingService] [u2024] watcher service stopped
[2025-12-12T11:50:12,741][INFO ][o.e.x.w.WatcherService   ] [u2024] stopping watch service, reason [shutdown initiated]
[2025-12-12T11:50:12,742][INFO ][o.e.x.w.WatcherLifeCycleService] [u2024] watcher has stopped and shutdown

Notice the warning: 1 jobs SIGHUPed message, the shell sent a signal to the node to shut itself down.

You can avoid this by using disown shell command after backgrounding it. But it would be better to integrate into systemd, which you would have "for free" had you had used the rpm install instead of .tar.gz. Anyways, try this:

% bin/elasticsearch >> ./stdout.txt 2>> stderr.txt &
[1] 1652489
% disown %1
% exit

(the "1" is the number in square brackets, technically the job number).

The JVM process(es), elasticsearch, will continue to run even if/when shell exits. You can also use nohup. Or start within screen sessions. Or a few other ways.

If this is the cause of recent crashes, its not clear why

if you were starting the exact same way months ago?

That notwithstanding, the Nov 21 crash was a OOMkiller crash.