Third node on cluster shuts down periodically

I've been running a 3 node cluster of Elasticsearch 6.8.0 for over a year now. Since it is up and running it only runs with 2 nodes, and every time I join the third node it stops working.

In this case the nodes running are node-1 and node-3 and when I start node-2 the cluster starts reallocating shards, but in a certain moment the node stops working. The same behavior was observed in repeated times previously but with node-3 joining the cluster. Since node-1 is exposed to the services that interact with ES I can't try the same, but I suspect it would happen also.

Here are node details and config files:

_cat/nodes

10.92.112.138 55 99 3 0.08 0.11 0.13 mdi * node-1
10.92.112.140 71 99 1 0.09 0.06 0.06 mdi - node-3
cluster.name: wilab-prod
node.name: node-2
path.data: /elasticsearch_1/data
path.logs: /elasticsearch_1/log
network.host: elasticwilab02.client.domain
discovery.zen.ping.unicast.hosts: ["elasticwilab01.client.domain", "elasticwilab02.client.domain", "elasticwilab03.client.domain"]
discovery.zen.minimum_master_nodes: 2

xpack.security.enabled: true
xpack.security.http.ssl.enabled: true
xpack.security.transport.ssl.enabled: true
xpack.security.transport.ssl.verification_mode: certificate
xpack.security.transport.ssl.keystore.path: /opt/elasticsearch/elasticsearch-6.8.0/config/elasticsearch-certificates.p12
xpack.security.transport.ssl.truststore.path: /opt/elasticsearch/elasticsearch-6.8.0/config/elasticsearch-certificates.p12
xpack.security.http.ssl.keystore.path: /opt/elasticsearch/elasticsearch-6.8.0/config/elasticsearch-certificates.p12
xpack.security.http.ssl.truststore.path: /opt/elasticsearch/elasticsearch-6.8.0/config/elasticsearch-certificates.p12
xpack.security.http.ssl.client_authentication: optional

The config file is identical for the three nodes, parameters such as node name, host and paths vary between each other.

Does anyone know what might be happening for the third node that joins the cluster to shut down?

Hello @javiroberts

I suggest to check:

  • If there are any log lines when the Elasticsearch node stops to understand if there's an Out of Memory or if it is gracefully stopped
    • If an heap dump is generated in case of OoM
  • If dmesg -T show any error related to Linux OoM Kernel killing the instance because the host is short on memory.

Hello @Luca_Belluccini thanks for the answer:

Here's my log output:

[2020-04-29T00:32:30,845][INFO ][o.e.n.Node               ] [node-2] started
[2020-04-29T02:33:09,052][INFO ][o.e.x.m.p.NativeController] [node-2] Native controller process has stopped - no new native processes can be started
[2020-04-29T02:33:09,055][INFO ][o.e.n.Node               ] [node-2] stopping ...
[2020-04-29T02:33:09,074][INFO ][o.e.x.w.WatcherService   ] [node-2] stopping watch service, reason [shutdown initiated]
[2020-04-29T02:33:09,216][WARN ][o.e.t.OutboundHandler    ] [node-2] send message failed [channel: Netty4TcpChannel{localAddress=0.0.0.0/0.0.0.0:9300, remoteAddress=/10.92.112.138:50038}]
java.nio.channels.ClosedChannelException: null
	at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source) ~[?:?]
[2020-04-29T02:33:09,218][WARN ][o.e.t.OutboundHandler    ] [node-2] send message failed [channel: Netty4TcpChannel{localAddress=0.0.0.0/0.0.0.0:9300, remoteAddress=/10.92.112.138:50026}]
java.nio.channels.ClosedChannelException: null
	at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source) ~[?:?]
[2020-04-29T02:33:09,796][INFO ][o.e.n.Node               ] [node-2] stopped
[2020-04-29T02:33:09,797][INFO ][o.e.n.Node               ] [node-2] closing ...
[2020-04-29T02:33:10,588][INFO ][o.e.n.Node               ] [node-2] closed

No output on dmesg -T latest log is from Mon Apr 6 09:33:12 2020.

Are those all the logs?

Would you please edit the log4j2.properties file in the Elasticsearch configuration directory and enable debug logging? E.g. rootLogger.level = debug.

Try to restart it again and upload the log file in a Github gist or similar.

Thanks @Luca_Belluccini.

Restarted with rootLogger.level = debug, but this time the node has been up and stable for many hours now and all the shards have been reallocated.

Since the only config parameter I changed was the logger level, it's pretty weird that this time everything went OK. Do you know what could have happened?

I would wait and monitor the situation keeping the logs in debug.
Watch for the disk space of the logs mount as it will be more chatty.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.