Elastic search data nodes kept crashing continuously

pratiksha · July 10, 2023, 6:39am

We are using Elasticsearch v7.10.2 deployed in kubernetes environment. Elasticsearch cluster was running fine earlier. As part of kubernetes cluster we updated a certificate in our k8s cluster which has not impacted elasticsearch in any way. After this all the elasticsearch pods got restarted and we see that ingest and master nodes have come up fine but we have 5 nodes deployed as data which all are continuously restarting with Error code 137. Tried increasing RAM and heap to higher values but it did not help.

Below log is seen before the restarts:
{"type":"log","host":"elasticsearch-data-0.","level":"INFO","time": "2023-06-02T08:56:08.887Z","logger":"o.e.e.NodeEnvironment","timezone":"UTC","marker":"[elasticsearch-data-0] ","log":{"message":"using [1] data paths, mounts [[/data (10.148.95.33:/ttsvmnas004_edennet_data/elasticsearch-data-0-pvc-d9a0cf2c-e0fb-402f-b3c6-62fe7db60b00)]], net usable_space [1.3tb], net total_space [1.8tb], types [nfs]"}}

This is cluster health output:
{
"cluster_name" : "test",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 7,
"number_of_data_nodes" : 1,
"active_primary_shards" : 7,
"active_shards" : 7,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 956,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 0.726895119418484
}

We tried updating memory to a higher value but it did not help. However, after 2 days the cluster without any change came back to proper state and data nodes stopped restarting and all shards got assigned and it is been working fine since then.
Cluster health output after cluster was fine
{ "cluster_name" : "test", "status" : "green", "timed_out" : false, "number_of_nodes" : 11, "number_of_data_nodes" : 5, "active_primary_shards" : 480, "active_shards" : 960, "relocating_shards" : 0, "initializing_shards" : 0, "unassigned_shards" : 0, "delayed_unassigned_shards" : 0, "number_of_pending_tasks" : 0, "number_of_in_flight_fetch" : 0, "task_max_waiting_in_queue_millis" : 0, "active_shards_percent_as_number" : 100.0 }

Please help in understanding as what could have lead to this behavior.
Thanks in advance.

Christian_Dahlqvist · July 10, 2023, 6:45am

Are you using NFS storage for the Elasticsearch data nodes? If so, this is generally not recommended and requires that the storage is mounted so it behaves like local storage as outlined in the docs.

The log entry you linked to does not show why the nodes crashed. Please provide the full Elasticsearch logs from the failed startup attempt.

Also note that you are using an old version that has been EOL a long time. I would recommend upgrading to at least version 7.17.

pratiksha · July 10, 2023, 8:54am

NFS storage is used but it is not shared type.
Below are the complete logs before it restarts, there is no error log seen before restarts:

{"type":"log","host":"elasticsearch-data-0.tmo-shcd","level":"WARN","time": "2023-06-02T08:56:07.268Z","logger":"o.e.c.l.LogConfigurator","timezone":"UTC","marker":"[elasticsearch-data-0] ","log":{"message":"Some logging configurations have %marker but don't have %node_name. We will automatically add %node_name to the pattern to ease the migration for users who customize log4j2.properties but will stop this behavior in 7.0. You should manually replace %node_name with [%node_name]%marker in these locations:
/etc/elasticsearch/esconfig/log4j2.properties"}}
{"type":"log","host":"elasticsearch-data-0.tmo-shcd","level":"INFO","time": "2023-06-02T08:56:07.572Z","logger":"o.e.n.Node","timezone":"UTC","marker":"[elasticsearch-data-0] ","log":{"message":"version[7.10.2], pid[18], build[oss/rpm/be1945029bf6730639d5bef8b39d8de9aa5efca8/2022-03-04T11:04:07.696525Z], OS[Linux/4.18.0-305.34.2.el8_4.x86_64/amd64], JVM[Red Hat, Inc./OpenJDK 64-Bit Server VM/11.0.14.1/11.0.14.1+1-LTS]"}}
{"type":"log","host":"elasticsearch-data-0.tmo-shcd","level":"INFO","time": "2023-06-02T08:56:07.572Z","logger":"o.e.n.Node","timezone":"UTC","marker":"[elasticsearch-data-0] ","log":{"message":"JVM home [/usr/lib/jvm/java-11-openjdk-11.0.14.1.1-1.el7_9.x86_64], using bundled JDK [false]"}}
{"type":"log","host":"elasticsearch-data-0.tmo-shcd","level":"INFO","time": "2023-06-02T08:56:07.572Z","logger":"o.e.n.Node","timezone":"UTC","marker":"[elasticsearch-data-0] ","log":{"message":"JVM arguments [-Xshare:auto, -Des.networkaddress.cache.ttl=60, -Des.networkaddress.cache.negative.ttl=10, -XX:+AlwaysPreTouch, -Xss1m, -Djava.awt.headless=true, -Dfile.encoding=UTF-8, -Djna.nosys=true, -XX:-OmitStackTraceInFastThrow, -Dio.netty.noUnsafe=true, -Dio.netty.noKeySetOptimization=true, -Dio.netty.recycler.maxCapacityPerThread=0, -Dio.netty.allocator.numDirectArenas=0, -Dlog4j.shutdownHookEnabled=false, -Dlog4j2.disable.jmx=true, -Djava.locale.providers=SPI,COMPAT, -XX:+UseG1GC, -XX:G1ReservePercent=25, -XX:InitiatingHeapOccupancyPercent=30, -Djava.io.tmpdir=/tmp/elasticsearch-2703224334137262050, -XX:+HeapDumpOnOutOfMemoryError, -XX:HeapDumpPath=/tmp/elasticsearch/heapdump.hprof, -XX:ErrorFile=/tmp/elasticsearch/hs_err.log, -Xlog:gc*=warning:file=/tmp/elasticsearch/gc.log:utctime,pid,tags,level:filecount=2,filesize=2m, -Des.cgroups.hierarchy.override=/, -Xms10g, -Xmx10g, -XX:MaxDirectMemorySize=5368709120, -Des.path.home=/usr/share/elasticsearch, -Des.path.conf=/etc/elasticsearch/esconfig, -Des.distribution.flavor=oss, -Des.distribution.type=rpm, -Des.bundled_jdk=true]"}}
{"type":"log","host":"elasticsearch-data-0.tmo-shcd","level":"INFO","time": "2023-06-02T08:56:08.489Z","logger":"o.e.p.p.PrometheusExporterPlugin","timezone":"UTC","marker":"[elasticsearch-data-0] ","log":{"message":"starting Prometheus exporter plugin"}}
{"type":"log","host":"elasticsearch-data-0.tmo-shcd","level":"INFO","time": "2023-06-02T08:56:08.820Z","logger":"o.e.p.PluginsService","timezone":"UTC","marker":"[elasticsearch-data-0] ","log":{"message":"loaded module [aggs-matrix-stats]"}}
{"type":"log","host":"elasticsearch-data-0.tmo-shcd","level":"INFO","time": "2023-06-02T08:56:08.820Z","logger":"o.e.p.PluginsService","timezone":"UTC","marker":"[elasticsearch-data-0] ","log":{"message":"loaded module [analysis-common]"}}
{"type":"log","host":"elasticsearch-data-0.tmo-shcd","level":"INFO","time": "2023-06-02T08:56:08.820Z","logger":"o.e.p.PluginsService","timezone":"UTC","marker":"[elasticsearch-data-0] ","log":{"message":"loaded module [geo]"}}
{"type":"log","host":"elasticsearch-data-0.tmo-shcd","level":"INFO","time": "2023-06-02T08:56:08.820Z","logger":"o.e.p.PluginsService","timezone":"UTC","marker":"[elasticsearch-data-0] ","log":{"message":"loaded module [ingest-common]"}}
{"type":"log","host":"elasticsearch-data-0.tmo-shcd","level":"INFO","time": "2023-06-02T08:56:08.821Z","logger":"o.e.p.PluginsService","timezone":"UTC","marker":"[elasticsearch-data-0] ","log":{"message":"loaded module [ingest-geoip]"}}
{"type":"log","host":"elasticsearch-data-0.tmo-shcd","level":"INFO","time": "2023-06-02T08:56:08.821Z","logger":"o.e.p.PluginsService","timezone":"UTC","marker":"[elasticsearch-data-0] ","log":{"message":"loaded module [ingest-user-agent]"}}
{"type":"log","host":"elasticsearch-data-0.tmo-shcd","level":"INFO","time": "2023-06-02T08:56:08.821Z","logger":"o.e.p.PluginsService","timezone":"UTC","marker":"[elasticsearch-data-0] ","log":{"message":"loaded module [kibana]"}}
{"type":"log","host":"elasticsearch-data-0.tmo-shcd","level":"INFO","time": "2023-06-02T08:56:08.821Z","logger":"o.e.p.PluginsService","timezone":"UTC","marker":"[elasticsearch-data-0] ","log":{"message":"loaded module [lang-expression]"}}
{"type":"log","host":"elasticsearch-data-0.tmo-shcd","level":"INFO","time": "2023-06-02T08:56:08.821Z","logger":"o.e.p.PluginsService","timezone":"UTC","marker":"[elasticsearch-data-0] ","log":{"message":"loaded module [lang-mustache]"}}
{"type":"log","host":"elasticsearch-data-0.tmo-shcd","level":"INFO","time": "2023-06-02T08:56:08.821Z","logger":"o.e.p.PluginsService","timezone":"UTC","marker":"[elasticsearch-data-0] ","log":{"message":"loaded module [lang-painless]"}}
{"type":"log","host":"elasticsearch-data-0.tmo-shcd","level":"INFO","time": "2023-06-02T08:56:08.821Z","logger":"o.e.p.PluginsService","timezone":"UTC","marker":"[elasticsearch-data-0] ","log":{"message":"loaded module [mapper-extras]"}}
{"type":"log","host":"elasticsearch-data-0.tmo-shcd","level":"INFO","time": "2023-06-02T08:56:08.821Z","logger":"o.e.p.PluginsService","timezone":"UTC","marker":"[elasticsearch-data-0] ","log":{"message":"loaded module [parent-join]"}}
{"type":"log","host":"elasticsearch-data-0.tmo-shcd","level":"INFO","time": "2023-06-02T08:56:08.821Z","logger":"o.e.p.PluginsService","timezone":"UTC","marker":"[elasticsearch-data-0] ","log":{"message":"loaded module [percolator]"}}
{"type":"log","host":"elasticsearch-data-0.tmo-shcd","level":"INFO","time": "2023-06-02T08:56:08.821Z","logger":"o.e.p.PluginsService","timezone":"UTC","marker":"[elasticsearch-data-0] ","log":{"message":"loaded module [rank-eval]"}}
{"type":"log","host":"elasticsearch-data-0.tmo-shcd","level":"INFO","time": "2023-06-02T08:56:08.821Z","logger":"o.e.p.PluginsService","timezone":"UTC","marker":"[elasticsearch-data-0] ","log":{"message":"loaded module [reindex]"}}
{"type":"log","host":"elasticsearch-data-0.tmo-shcd","level":"INFO","time": "2023-06-02T08:56:08.821Z","logger":"o.e.p.PluginsService","timezone":"UTC","marker":"[elasticsearch-data-0] ","log":{"message":"loaded module [repository-url]"}}
{"type":"log","host":"elasticsearch-data-0.tmo-shcd","level":"INFO","time": "2023-06-02T08:56:08.822Z","logger":"o.e.p.PluginsService","timezone":"UTC","marker":"[elasticsearch-data-0] ","log":{"message":"loaded module [systemd]"}}
{"type":"log","host":"elasticsearch-data-0.tmo-shcd","level":"INFO","time": "2023-06-02T08:56:08.822Z","logger":"o.e.p.PluginsService","timezone":"UTC","marker":"[elasticsearch-data-0] ","log":{"message":"loaded module [transport-netty4]"}}
{"type":"log","host":"elasticsearch-data-0.tmo-shcd","level":"INFO","time": "2023-06-02T08:56:08.822Z","logger":"o.e.p.PluginsService","timezone":"UTC","marker":"[elasticsearch-data-0] ","log":{"message":"loaded plugin [ingest-attachment]"}}
{"type":"log","host":"elasticsearch-data-0.tmo-shcd","level":"INFO","time": "2023-06-02T08:56:08.822Z","logger":"o.e.p.PluginsService","timezone":"UTC","marker":"[elasticsearch-data-0] ","log":{"message":"loaded plugin [prometheus-exporter]"}}
{"type":"log","host":"elasticsearch-data-0.tmo-shcd","level":"INFO","time": "2023-06-02T08:56:08.887Z","logger":"o.e.e.NodeEnvironment","timezone":"UTC","marker":"[elasticsearch-data-0] ","log":{"message":"using [1] data paths, mounts [[/data (10.148.95.33:/ttsvmnas004_edennet_data/tmo-shcd-datadir-elasticsearch-data-0-pvc-d9a0cf2c-e0fb-402f-b3c6-62fe7db60b00)]], net usable_space [1.3tb], net total_space [1.8tb], types [nfs]"}}
{"type":"log","host":"elasticsearch-data-0.tmo-shcd","level":"INFO","time": "2023-06-02T08:56:08.887Z","logger":"o.e.e.NodeEnvironment","timezone":"UTC","marker":"[elasticsearch-data-0] ","log":{"message":"heap size [10gb], compressed ordinary object pointers [true]"}}

pratiksha · July 12, 2023, 9:37am

Please check this and help.

Christian_Dahlqvist · July 12, 2023, 9:48am

I do not see any error in the logs so there is not much to go on.

pratiksha · July 12, 2023, 9:50am

Yes, this is what is not clear that why data nodes are restrting after this log message as when there is no error itself. What could be the root cause in such scenario or how this can be debugged.

pratiksha · July 21, 2023, 8:12am

Hi,

For current scenario there are no error logs observed in the application logs. However, Can heap dump details be helpful in identifying and debugging of such scenarios? Also, if there is any other way to help in debugging of such above mentioned scenarios.

DavidTurner · July 21, 2023, 8:43am

Pretty much the only way a node will shut down without logging anything is if it receives a SIGKILL signal. This could be sent by any sufficiently-privileged process running on the same machine, but often it's the kernel's OOM killer. You'll need to look for logs related to this (e.g. the OOM killer reports its activities in the kernel logs which you can view with dmesg).

system · August 18, 2023, 8:44am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Elasticsearch nodes keep restarting on Kubernetes Elasticsearch	1	1649	April 25, 2018
Repeated cluster failures in multi-node cluster Elasticsearch	20	1493	April 3, 2020
How do I find the reason for a failed data node? (Elasticsearch 6.5) Elasticsearch	2	672	July 8, 2019
Elasticsearch cluster is down Elasticsearch	5	1316	December 13, 2019
Elasticsearch ECK in CrashLoopBackoff after node failure Elasticsearch	5	256	May 28, 2024

Elastic search data nodes kept crashing continuously

Related topics