Elastic search data nodes kept crashing continuously

We are using Elasticsearch v7.10.2 deployed in kubernetes environment. Elasticsearch cluster was running fine earlier. As part of kubernetes cluster we updated a certificate in our k8s cluster which has not impacted elasticsearch in any way. After this all the elasticsearch pods got restarted and we see that ingest and master nodes have come up fine but we have 5 nodes deployed as data which all are continuously restarting with Error code 137. Tried increasing RAM and heap to higher values but it did not help.

Below log is seen before the restarts:
{"type":"log","host":"elasticsearch-data-0.","level":"INFO","time": "2023-06-02T08:56:08.887Z","logger":"o.e.e.NodeEnvironment","timezone":"UTC","marker":"[elasticsearch-data-0] ","log":{"message":"using [1] data paths, mounts [[/data (10.148.95.33:/ttsvmnas004_edennet_data/elasticsearch-data-0-pvc-d9a0cf2c-e0fb-402f-b3c6-62fe7db60b00)]], net usable_space [1.3tb], net total_space [1.8tb], types [nfs]"}}

This is cluster health output:
{
"cluster_name" : "test",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 7,
"number_of_data_nodes" : 1,
"active_primary_shards" : 7,
"active_shards" : 7,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 956,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 0.726895119418484
}

We tried updating memory to a higher value but it did not help. However, after 2 days the cluster without any change came back to proper state and data nodes stopped restarting and all shards got assigned and it is been working fine since then.
Cluster health output after cluster was fine
{ "cluster_name" : "test", "status" : "green", "timed_out" : false, "number_of_nodes" : 11, "number_of_data_nodes" : 5, "active_primary_shards" : 480, "active_shards" : 960, "relocating_shards" : 0, "initializing_shards" : 0, "unassigned_shards" : 0, "delayed_unassigned_shards" : 0, "number_of_pending_tasks" : 0, "number_of_in_flight_fetch" : 0, "task_max_waiting_in_queue_millis" : 0, "active_shards_percent_as_number" : 100.0 }

Please help in understanding as what could have lead to this behavior.
Thanks in advance.

Are you using NFS storage for the Elasticsearch data nodes? If so, this is generally not recommended and requires that the storage is mounted so it behaves like local storage as outlined in the docs.

The log entry you linked to does not show why the nodes crashed. Please provide the full Elasticsearch logs from the failed startup attempt.

Also note that you are using an old version that has been EOL a long time. I would recommend upgrading to at least version 7.17.

NFS storage is used but it is not shared type.
Below are the complete logs before it restarts, there is no error log seen before restarts:

{"type":"log","host":"elasticsearch-data-0.tmo-shcd","level":"WARN","time": "2023-06-02T08:56:07.268Z","logger":"o.e.c.l.LogConfigurator","timezone":"UTC","marker":"[elasticsearch-data-0] ","log":{"message":"Some logging configurations have %marker but don't have %node_name. We will automatically add %node_name to the pattern to ease the migration for users who customize log4j2.properties but will stop this behavior in 7.0. You should manually replace %node_name with [%node_name]%marker in these locations:
/etc/elasticsearch/esconfig/log4j2.properties"}}
{"type":"log","host":"elasticsearch-data-0.tmo-shcd","level":"INFO","time": "2023-06-02T08:56:07.572Z","logger":"o.e.n.Node","timezone":"UTC","marker":"[elasticsearch-data-0] ","log":{"message":"version[7.10.2], pid[18], build[oss/rpm/be1945029bf6730639d5bef8b39d8de9aa5efca8/2022-03-04T11:04:07.696525Z], OS[Linux/4.18.0-305.34.2.el8_4.x86_64/amd64], JVM[Red Hat, Inc./OpenJDK 64-Bit Server VM/11.0.14.1/11.0.14.1+1-LTS]"}}
{"type":"log","host":"elasticsearch-data-0.tmo-shcd","level":"INFO","time": "2023-06-02T08:56:07.572Z","logger":"o.e.n.Node","timezone":"UTC","marker":"[elasticsearch-data-0] ","log":{"message":"JVM home [/usr/lib/jvm/java-11-openjdk-11.0.14.1.1-1.el7_9.x86_64], using bundled JDK [false]"}}
{"type":"log","host":"elasticsearch-data-0.tmo-shcd","level":"INFO","time": "2023-06-02T08:56:07.572Z","logger":"o.e.n.Node","timezone":"UTC","marker":"[elasticsearch-data-0] ","log":{"message":"JVM arguments [-Xshare:auto, -Des.networkaddress.cache.ttl=60, -Des.networkaddress.cache.negative.ttl=10, -XX:+AlwaysPreTouch, -Xss1m, -Djava.awt.headless=true, -Dfile.encoding=UTF-8, -Djna.nosys=true, -XX:-OmitStackTraceInFastThrow, -Dio.netty.noUnsafe=true, -Dio.netty.noKeySetOptimization=true, -Dio.netty.recycler.maxCapacityPerThread=0, -Dio.netty.allocator.numDirectArenas=0, -Dlog4j.shutdownHookEnabled=false, -Dlog4j2.disable.jmx=true, -Djava.locale.providers=SPI,COMPAT, -XX:+UseG1GC, -XX:G1ReservePercent=25, -XX:InitiatingHeapOccupancyPercent=30, -Djava.io.tmpdir=/tmp/elasticsearch-2703224334137262050, -XX:+HeapDumpOnOutOfMemoryError, -XX:HeapDumpPath=/tmp/elasticsearch/heapdump.hprof, -XX:ErrorFile=/tmp/elasticsearch/hs_err.log, -Xlog:gc*=warning:file=/tmp/elasticsearch/gc.log:utctime,pid,tags,level:filecount=2,filesize=2m, -Des.cgroups.hierarchy.override=/, -Xms10g, -Xmx10g, -XX:MaxDirectMemorySize=5368709120, -Des.path.home=/usr/share/elasticsearch, -Des.path.conf=/etc/elasticsearch/esconfig, -Des.distribution.flavor=oss, -Des.distribution.type=rpm, -Des.bundled_jdk=true]"}}
{"type":"log","host":"elasticsearch-data-0.tmo-shcd","level":"INFO","time": "2023-06-02T08:56:08.489Z","logger":"o.e.p.p.PrometheusExporterPlugin","timezone":"UTC","marker":"[elasticsearch-data-0] ","log":{"message":"starting Prometheus exporter plugin"}}
{"type":"log","host":"elasticsearch-data-0.tmo-shcd","level":"INFO","time": "2023-06-02T08:56:08.820Z","logger":"o.e.p.PluginsService","timezone":"UTC","marker":"[elasticsearch-data-0] ","log":{"message":"loaded module [aggs-matrix-stats]"}}
{"type":"log","host":"elasticsearch-data-0.tmo-shcd","level":"INFO","time": "2023-06-02T08:56:08.820Z","logger":"o.e.p.PluginsService","timezone":"UTC","marker":"[elasticsearch-data-0] ","log":{"message":"loaded module [analysis-common]"}}
{"type":"log","host":"elasticsearch-data-0.tmo-shcd","level":"INFO","time": "2023-06-02T08:56:08.820Z","logger":"o.e.p.PluginsService","timezone":"UTC","marker":"[elasticsearch-data-0] ","log":{"message":"loaded module [geo]"}}
{"type":"log","host":"elasticsearch-data-0.tmo-shcd","level":"INFO","time": "2023-06-02T08:56:08.820Z","logger":"o.e.p.PluginsService","timezone":"UTC","marker":"[elasticsearch-data-0] ","log":{"message":"loaded module [ingest-common]"}}
{"type":"log","host":"elasticsearch-data-0.tmo-shcd","level":"INFO","time": "2023-06-02T08:56:08.821Z","logger":"o.e.p.PluginsService","timezone":"UTC","marker":"[elasticsearch-data-0] ","log":{"message":"loaded module [ingest-geoip]"}}
{"type":"log","host":"elasticsearch-data-0.tmo-shcd","level":"INFO","time": "2023-06-02T08:56:08.821Z","logger":"o.e.p.PluginsService","timezone":"UTC","marker":"[elasticsearch-data-0] ","log":{"message":"loaded module [ingest-user-agent]"}}
{"type":"log","host":"elasticsearch-data-0.tmo-shcd","level":"INFO","time": "2023-06-02T08:56:08.821Z","logger":"o.e.p.PluginsService","timezone":"UTC","marker":"[elasticsearch-data-0] ","log":{"message":"loaded module [kibana]"}}
{"type":"log","host":"elasticsearch-data-0.tmo-shcd","level":"INFO","time": "2023-06-02T08:56:08.821Z","logger":"o.e.p.PluginsService","timezone":"UTC","marker":"[elasticsearch-data-0] ","log":{"message":"loaded module [lang-expression]"}}
{"type":"log","host":"elasticsearch-data-0.tmo-shcd","level":"INFO","time": "2023-06-02T08:56:08.821Z","logger":"o.e.p.PluginsService","timezone":"UTC","marker":"[elasticsearch-data-0] ","log":{"message":"loaded module [lang-mustache]"}}
{"type":"log","host":"elasticsearch-data-0.tmo-shcd","level":"INFO","time": "2023-06-02T08:56:08.821Z","logger":"o.e.p.PluginsService","timezone":"UTC","marker":"[elasticsearch-data-0] ","log":{"message":"loaded module [lang-painless]"}}
{"type":"log","host":"elasticsearch-data-0.tmo-shcd","level":"INFO","time": "2023-06-02T08:56:08.821Z","logger":"o.e.p.PluginsService","timezone":"UTC","marker":"[elasticsearch-data-0] ","log":{"message":"loaded module [mapper-extras]"}}
{"type":"log","host":"elasticsearch-data-0.tmo-shcd","level":"INFO","time": "2023-06-02T08:56:08.821Z","logger":"o.e.p.PluginsService","timezone":"UTC","marker":"[elasticsearch-data-0] ","log":{"message":"loaded module [parent-join]"}}
{"type":"log","host":"elasticsearch-data-0.tmo-shcd","level":"INFO","time": "2023-06-02T08:56:08.821Z","logger":"o.e.p.PluginsService","timezone":"UTC","marker":"[elasticsearch-data-0] ","log":{"message":"loaded module [percolator]"}}
{"type":"log","host":"elasticsearch-data-0.tmo-shcd","level":"INFO","time": "2023-06-02T08:56:08.821Z","logger":"o.e.p.PluginsService","timezone":"UTC","marker":"[elasticsearch-data-0] ","log":{"message":"loaded module [rank-eval]"}}
{"type":"log","host":"elasticsearch-data-0.tmo-shcd","level":"INFO","time": "2023-06-02T08:56:08.821Z","logger":"o.e.p.PluginsService","timezone":"UTC","marker":"[elasticsearch-data-0] ","log":{"message":"loaded module [reindex]"}}
{"type":"log","host":"elasticsearch-data-0.tmo-shcd","level":"INFO","time": "2023-06-02T08:56:08.821Z","logger":"o.e.p.PluginsService","timezone":"UTC","marker":"[elasticsearch-data-0] ","log":{"message":"loaded module [repository-url]"}}
{"type":"log","host":"elasticsearch-data-0.tmo-shcd","level":"INFO","time": "2023-06-02T08:56:08.822Z","logger":"o.e.p.PluginsService","timezone":"UTC","marker":"[elasticsearch-data-0] ","log":{"message":"loaded module [systemd]"}}
{"type":"log","host":"elasticsearch-data-0.tmo-shcd","level":"INFO","time": "2023-06-02T08:56:08.822Z","logger":"o.e.p.PluginsService","timezone":"UTC","marker":"[elasticsearch-data-0] ","log":{"message":"loaded module [transport-netty4]"}}
{"type":"log","host":"elasticsearch-data-0.tmo-shcd","level":"INFO","time": "2023-06-02T08:56:08.822Z","logger":"o.e.p.PluginsService","timezone":"UTC","marker":"[elasticsearch-data-0] ","log":{"message":"loaded plugin [ingest-attachment]"}}
{"type":"log","host":"elasticsearch-data-0.tmo-shcd","level":"INFO","time": "2023-06-02T08:56:08.822Z","logger":"o.e.p.PluginsService","timezone":"UTC","marker":"[elasticsearch-data-0] ","log":{"message":"loaded plugin [prometheus-exporter]"}}
{"type":"log","host":"elasticsearch-data-0.tmo-shcd","level":"INFO","time": "2023-06-02T08:56:08.887Z","logger":"o.e.e.NodeEnvironment","timezone":"UTC","marker":"[elasticsearch-data-0] ","log":{"message":"using [1] data paths, mounts [[/data (10.148.95.33:/ttsvmnas004_edennet_data/tmo-shcd-datadir-elasticsearch-data-0-pvc-d9a0cf2c-e0fb-402f-b3c6-62fe7db60b00)]], net usable_space [1.3tb], net total_space [1.8tb], types [nfs]"}}
{"type":"log","host":"elasticsearch-data-0.tmo-shcd","level":"INFO","time": "2023-06-02T08:56:08.887Z","logger":"o.e.e.NodeEnvironment","timezone":"UTC","marker":"[elasticsearch-data-0] ","log":{"message":"heap size [10gb], compressed ordinary object pointers [true]"}}

Please check this and help.

I do not see any error in the logs so there is not much to go on.

Yes, this is what is not clear that why data nodes are restrting after this log message as when there is no error itself. What could be the root cause in such scenario or how this can be debugged.

Hi,

For current scenario there are no error logs observed in the application logs. However, Can heap dump details be helpful in identifying and debugging of such scenarios? Also, if there is any other way to help in debugging of such above mentioned scenarios.

Pretty much the only way a node will shut down without logging anything is if it receives a SIGKILL signal. This could be sent by any sufficiently-privileged process running on the same machine, but often it's the kernel's OOM killer. You'll need to look for logs related to this (e.g. the OOM killer reports its activities in the kernel logs which you can view with dmesg).

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.