Master eligible node won't rejoin cluster after reboot

adrian0 · May 25, 2020, 6:23pm

We have a cluster of multiple nodes, of which are 3 dedicated master nodes. I wanted to do a normal rolling package (apt) update/upgrade and reboot, but when I restarted the first eligible master node, it wouldn't join the cluster after the reboot.

ES Version 7.4.2
esm01 - Master elected node
esm02 - eligible master node
esm03 - eligible master node - rebooted and wont join the existing cluster.

Config esm03:

cluster.name: cluster01
node.name: esm03
node.attr.rack: virtual
node.master: true
node.data: false

path.data: /es/data
path.logs: /es/logs

http.port: 9200
http.bind_host: X.X.1.42

transport.tcp.port: 9300
transport.bind_host: X.X.2.42
transport.publish_host: X.X.2.42

discovery.seed_hosts: ["esm01", "esm02", "esm03"]
gateway.recover_after_nodes: 5
action.destructive_requires_name: false

transport.tcp.connect_timeout: 120s

In the logs from esm03, all I see is this entry over and over again:

[2020-05-25T17:57:10,637][WARN ][o.e.c.c.ClusterFormationFailureHelper]
[esm03] master not discovered or elected yet, an election requires at least 2 
nodes with ids from [BXQ6ct83RDuDqQqJQ-3CIw, L2jT3WjSRqmjInBZm7xgyA, 
u5qnw0QZS3WrDFNMKcLQkQ], have discovered [{esm03}
{BXQ6ct83RDuDqQqJQ-3CIw}{fi8SPj2xSpOjkB-dJIHalQ}{X.X.2.42}{X.X.2.42:9300}
{ilm}{ml.machine_memory=8371269632, rack=virtual, xpack.installed=true, 
ml.max_open_jobs=20}] which is not a quorum; discovery will continue using 
[X.X.1.40:9300, X.X.1.41:9300, 127.0.1.1:9300] from hosts providers and [{esm03}
{BXQ6ct83RDuDqQqJQ-3CIw}{fi8SPj2xSpOjkB-dJIHalQ}{X.X.2.42}{X.X.2.42:9300}
{ilm}{ml.machine_memory=8371269632, rack=virtual, xpack.installed=true, 
ml.max_open_jobs=20}] from last-known cluster state; node term 9, last-
accepted version 463803 in term 9

The ID's are corresponding with the existing master nodes:

L2jT3WjSRqmjInBZm7xgyA - esm01
u5qnw0QZS3WrDFNMKcLQkQ - esm02
BXQ6ct83RDuDqQqJQ-3CIw - esm03

I can ping the other ES nodes, and the UFW is open for connections on both port 9200 and 9300. I do however not see attempts on joining the cluster in the logs on the other master nodes esm01 and esm02.

Where do I go from here? Could it be an external firewall blocking traffic?

Sincerely,
Adrian

DavidTurner · May 25, 2020, 6:43pm

Looks like a connectivity issue indeed, this node is not even discovering the other masters.

Are you sure that these addresses are right? This node is bound to X.X.2.42 not X.X.1.42, so maybe on a different network.

adrian0 · May 25, 2020, 6:48pm

I'm not sure the addresses are correct. We have seperated the HTTP traffic on port 9200 and management traffic on port 9300 to two different subnets.

Should I try with IP addresses instead of hostnames in discovery.seed_hosts? Or is there a better way of doing it, more dynamically (if the IP was to change for instance)?

adrian0 · May 25, 2020, 7:02pm

You were right, I tried using IP addresses instead of the hostnames, and it worked. Hm, then I got to figure out how to make them resolve to the correct interface on the server...

Thanks for the help!

DavidTurner · May 25, 2020, 7:08pm

I'd say to stick with hostnames and adjust your DNS system if they're not resolving as you need them to.

The important address is the transport publish address. Every node needs to be able to access every other node's transport publish address. You can obtain this from the nodes' logs at startup:

[2020-05-25T20:03:40,148][INFO ][o.e.t.TransportService   ] [node-0] publish_address {127.0.0.1:9300}, bound_addresses {127.0.0.1:9300}

Alternatively you can request it from a running node with GET /_nodes/_local/transport?filter_path=nodes.*.name,nodes.*.transport.publish_address (i.e. curl 'http://$HTTP_ADDRESS/_nodes/_local/transport?filter_path=nodes.*.name,nodes.*.transport.publish_address'.

DavidTurner · May 25, 2020, 7:08pm

Ah sorry just saw your second response. Nice work, well done. Configuring DNS is out of scope of what I can help you with, sorry.

system · June 22, 2020, 7:08pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Node failing to join the cluster after reboot Elasticsearch	9	2870	July 6, 2017
Can not elect master when restarting cluster from 7.3 upgrade Elasticsearch	19	6691	October 2, 2019
Master node can not rejoin cluster after restart on es2.0.0. but other nodes can rejoin cluster after restart #15916 Elasticsearch	5	995	July 5, 2017
Node cant rejoin to cluster (after reboot) Elasticsearch	2	1482	September 24, 2018
ES 7.3 - restarting data node doesn't rejoin cluster Elasticsearch	6	2037	December 21, 2020

Master eligible node won't rejoin cluster after reboot

Related topics