I've been running a 3 node cluster of Elasticsearch 6.8.0 for over a year now. Since it is up and running it only runs with 2 nodes, and every time I join the third node it stops working.
In this case the nodes running are node-1 and node-3 and when I start node-2 the cluster starts reallocating shards, but in a certain moment the node stops working. The same behavior was observed in repeated times previously but with node-3 joining the cluster. Since node-1 is exposed to the services that interact with ES I can't try the same, but I suspect it would happen also.
Here are node details and config files:
_cat/nodes
10.92.112.138 55 99 3 0.08 0.11 0.13 mdi * node-1
10.92.112.140 71 99 1 0.09 0.06 0.06 mdi - node-3
cluster.name: wilab-prod
node.name: node-2
path.data: /elasticsearch_1/data
path.logs: /elasticsearch_1/log
network.host: elasticwilab02.client.domain
discovery.zen.ping.unicast.hosts: ["elasticwilab01.client.domain", "elasticwilab02.client.domain", "elasticwilab03.client.domain"]
discovery.zen.minimum_master_nodes: 2
xpack.security.enabled: true
xpack.security.http.ssl.enabled: true
xpack.security.transport.ssl.enabled: true
xpack.security.transport.ssl.verification_mode: certificate
xpack.security.transport.ssl.keystore.path: /opt/elasticsearch/elasticsearch-6.8.0/config/elasticsearch-certificates.p12
xpack.security.transport.ssl.truststore.path: /opt/elasticsearch/elasticsearch-6.8.0/config/elasticsearch-certificates.p12
xpack.security.http.ssl.keystore.path: /opt/elasticsearch/elasticsearch-6.8.0/config/elasticsearch-certificates.p12
xpack.security.http.ssl.truststore.path: /opt/elasticsearch/elasticsearch-6.8.0/config/elasticsearch-certificates.p12
xpack.security.http.ssl.client_authentication: optional
The config file is identical for the three nodes, parameters such as node name, host and paths vary between each other.
Does anyone know what might be happening for the third node that joins the cluster to shut down?
Hello @Luca_Belluccini thanks for the answer:
Here's my log output:
[2020-04-29T00:32:30,845][INFO ][o.e.n.Node ] [node-2] started
[2020-04-29T02:33:09,052][INFO ][o.e.x.m.p.NativeController] [node-2] Native controller process has stopped - no new native processes can be started
[2020-04-29T02:33:09,055][INFO ][o.e.n.Node ] [node-2] stopping ...
[2020-04-29T02:33:09,074][INFO ][o.e.x.w.WatcherService ] [node-2] stopping watch service, reason [shutdown initiated]
[2020-04-29T02:33:09,216][WARN ][o.e.t.OutboundHandler ] [node-2] send message failed [channel: Netty4TcpChannel{localAddress=0.0.0.0/0.0.0.0:9300, remoteAddress=/10.92.112.138:50038}]
java.nio.channels.ClosedChannelException: null
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source) ~[?:?]
[2020-04-29T02:33:09,218][WARN ][o.e.t.OutboundHandler ] [node-2] send message failed [channel: Netty4TcpChannel{localAddress=0.0.0.0/0.0.0.0:9300, remoteAddress=/10.92.112.138:50026}]
java.nio.channels.ClosedChannelException: null
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source) ~[?:?]
[2020-04-29T02:33:09,796][INFO ][o.e.n.Node ] [node-2] stopped
[2020-04-29T02:33:09,797][INFO ][o.e.n.Node ] [node-2] closing ...
[2020-04-29T02:33:10,588][INFO ][o.e.n.Node ] [node-2] closed
No output on dmesg -T latest log is from Mon Apr 6 09:33:12 2020.
Are those all the logs?
Would you please edit the log4j2.properties file in the Elasticsearch configuration directory and enable debug logging? E.g. rootLogger.level = debug.
Try to restart it again and upload the log file in a Github gist or similar.
Thanks @Luca_Belluccini.
Restarted with rootLogger.level = debug, but this time the node has been up and stable for many hours now and all the shards have been reallocated.
Since the only config parameter I changed was the logger level, it's pretty weird that this time everything went OK. Do you know what could have happened?
I would wait and monitor the situation keeping the logs in debug.
Watch for the disk space of the logs mount as it will be more chatty.