Third node on cluster shuts down periodically

javiroberts · April 29, 2020, 2:07pm

I've been running a 3 node cluster of Elasticsearch 6.8.0 for over a year now. Since it is up and running it only runs with 2 nodes, and every time I join the third node it stops working.

In this case the nodes running are node-1 and node-3 and when I start node-2 the cluster starts reallocating shards, but in a certain moment the node stops working. The same behavior was observed in repeated times previously but with node-3 joining the cluster. Since node-1 is exposed to the services that interact with ES I can't try the same, but I suspect it would happen also.

Here are node details and config files:

_cat/nodes

10.92.112.138 55 99 3 0.08 0.11 0.13 mdi * node-1
10.92.112.140 71 99 1 0.09 0.06 0.06 mdi - node-3

cluster.name: wilab-prod
node.name: node-2
path.data: /elasticsearch_1/data
path.logs: /elasticsearch_1/log
network.host: elasticwilab02.client.domain
discovery.zen.ping.unicast.hosts: ["elasticwilab01.client.domain", "elasticwilab02.client.domain", "elasticwilab03.client.domain"]
discovery.zen.minimum_master_nodes: 2

xpack.security.enabled: true
xpack.security.http.ssl.enabled: true
xpack.security.transport.ssl.enabled: true
xpack.security.transport.ssl.verification_mode: certificate
xpack.security.transport.ssl.keystore.path: /opt/elasticsearch/elasticsearch-6.8.0/config/elasticsearch-certificates.p12
xpack.security.transport.ssl.truststore.path: /opt/elasticsearch/elasticsearch-6.8.0/config/elasticsearch-certificates.p12
xpack.security.http.ssl.keystore.path: /opt/elasticsearch/elasticsearch-6.8.0/config/elasticsearch-certificates.p12
xpack.security.http.ssl.truststore.path: /opt/elasticsearch/elasticsearch-6.8.0/config/elasticsearch-certificates.p12
xpack.security.http.ssl.client_authentication: optional

The config file is identical for the three nodes, parameters such as node name, host and paths vary between each other.

Does anyone know what might be happening for the third node that joins the cluster to shut down?

Luca_Belluccini · April 29, 2020, 5:39pm

Hello @javiroberts

I suggest to check:

If there are any log lines when the Elasticsearch node stops to understand if there's an Out of Memory or if it is gracefully stopped
- If an heap dump is generated in case of OoM
If dmesg -T show any error related to Linux OoM Kernel killing the instance because the host is short on memory.

javiroberts · April 29, 2020, 7:44pm

Hello @Luca_Belluccini thanks for the answer:

Here's my log output:

[2020-04-29T00:32:30,845][INFO ][o.e.n.Node               ] [node-2] started
[2020-04-29T02:33:09,052][INFO ][o.e.x.m.p.NativeController] [node-2] Native controller process has stopped - no new native processes can be started
[2020-04-29T02:33:09,055][INFO ][o.e.n.Node               ] [node-2] stopping ...
[2020-04-29T02:33:09,074][INFO ][o.e.x.w.WatcherService   ] [node-2] stopping watch service, reason [shutdown initiated]
[2020-04-29T02:33:09,216][WARN ][o.e.t.OutboundHandler    ] [node-2] send message failed [channel: Netty4TcpChannel{localAddress=0.0.0.0/0.0.0.0:9300, remoteAddress=/10.92.112.138:50038}]
java.nio.channels.ClosedChannelException: null
	at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source) ~[?:?]
[2020-04-29T02:33:09,218][WARN ][o.e.t.OutboundHandler    ] [node-2] send message failed [channel: Netty4TcpChannel{localAddress=0.0.0.0/0.0.0.0:9300, remoteAddress=/10.92.112.138:50026}]
java.nio.channels.ClosedChannelException: null
	at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source) ~[?:?]
[2020-04-29T02:33:09,796][INFO ][o.e.n.Node               ] [node-2] stopped
[2020-04-29T02:33:09,797][INFO ][o.e.n.Node               ] [node-2] closing ...
[2020-04-29T02:33:10,588][INFO ][o.e.n.Node               ] [node-2] closed

No output on dmesg -T latest log is from Mon Apr 6 09:33:12 2020.

Luca_Belluccini · April 29, 2020, 7:59pm

Are those all the logs?

Would you please edit the log4j2.properties file in the Elasticsearch configuration directory and enable debug logging? E.g. rootLogger.level = debug.

Try to restart it again and upload the log file in a Github gist or similar.

javiroberts · April 30, 2020, 3:59am

Thanks @Luca_Belluccini.

Restarted with rootLogger.level = debug, but this time the node has been up and stable for many hours now and all the shards have been reallocated.

Since the only config parameter I changed was the logger level, it's pretty weird that this time everything went OK. Do you know what could have happened?

Luca_Belluccini · April 30, 2020, 7:15am

I would wait and monitor the situation keeping the logs in debug.
Watch for the disk space of the logs mount as it will be more chatty.

system · May 28, 2020, 7:15am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Three node cluster failes completely when one node shuts down Elasticsearch	10	317	February 7, 2023
2 Nodes crashed, how to get last Node up an running Elasticsearch	5	239	July 29, 2022
Finding the reason behind random node shutdowns Elasticsearch	4	677	July 5, 2017
Multiple indexes break elasticsearch (2.3.1) cluster replication? Elasticsearch	5	395	January 18, 2019
Elasticsearch 3 nodes cluster not joining with each other Elasticsearch	17	2279	July 30, 2021

Third node on cluster shuts down periodically

Related Topics