Master eligible node won't rejoin cluster after reboot

We have a cluster of multiple nodes, of which are 3 dedicated master nodes. I wanted to do a normal rolling package (apt) update/upgrade and reboot, but when I restarted the first eligible master node, it wouldn't join the cluster after the reboot.

ES Version 7.4.2
esm01 - Master elected node
esm02 - eligible master node
esm03 - eligible master node - rebooted and wont join the existing cluster.

Config esm03:

cluster.name: cluster01
node.name: esm03
node.attr.rack: virtual
node.master: true
node.data: false

path.data: /es/data
path.logs: /es/logs

http.port: 9200
http.bind_host: X.X.1.42

transport.tcp.port: 9300
transport.bind_host: X.X.2.42
transport.publish_host: X.X.2.42

discovery.seed_hosts: ["esm01", "esm02", "esm03"]
gateway.recover_after_nodes: 5
action.destructive_requires_name: false

transport.tcp.connect_timeout: 120s

In the logs from esm03, all I see is this entry over and over again:

[2020-05-25T17:57:10,637][WARN ][o.e.c.c.ClusterFormationFailureHelper]
[esm03] master not discovered or elected yet, an election requires at least 2 
nodes with ids from [BXQ6ct83RDuDqQqJQ-3CIw, L2jT3WjSRqmjInBZm7xgyA, 
u5qnw0QZS3WrDFNMKcLQkQ], have discovered [{esm03}
{BXQ6ct83RDuDqQqJQ-3CIw}{fi8SPj2xSpOjkB-dJIHalQ}{X.X.2.42}{X.X.2.42:9300}
{ilm}{ml.machine_memory=8371269632, rack=virtual, xpack.installed=true, 
ml.max_open_jobs=20}] which is not a quorum; discovery will continue using 
[X.X.1.40:9300, X.X.1.41:9300, 127.0.1.1:9300] from hosts providers and [{esm03}
{BXQ6ct83RDuDqQqJQ-3CIw}{fi8SPj2xSpOjkB-dJIHalQ}{X.X.2.42}{X.X.2.42:9300}
{ilm}{ml.machine_memory=8371269632, rack=virtual, xpack.installed=true, 
ml.max_open_jobs=20}] from last-known cluster state; node term 9, last-
accepted version 463803 in term 9

The ID's are corresponding with the existing master nodes:

L2jT3WjSRqmjInBZm7xgyA - esm01
u5qnw0QZS3WrDFNMKcLQkQ - esm02
BXQ6ct83RDuDqQqJQ-3CIw - esm03

I can ping the other ES nodes, and the UFW is open for connections on both port 9200 and 9300. I do however not see attempts on joining the cluster in the logs on the other master nodes esm01 and esm02.

Where do I go from here? Could it be an external firewall blocking traffic?

Sincerely,
Adrian

Looks like a connectivity issue indeed, this node is not even discovering the other masters.

Are you sure that these addresses are right? This node is bound to X.X.2.42 not X.X.1.42, so maybe on a different network.

I'm not sure the addresses are correct. We have seperated the HTTP traffic on port 9200 and management traffic on port 9300 to two different subnets.

Should I try with IP addresses instead of hostnames in discovery.seed_hosts? Or is there a better way of doing it, more dynamically (if the IP was to change for instance)?

You were right, I tried using IP addresses instead of the hostnames, and it worked. Hm, then I got to figure out how to make them resolve to the correct interface on the server...

Thanks for the help!

I'd say to stick with hostnames and adjust your DNS system if they're not resolving as you need them to.

The important address is the transport publish address. Every node needs to be able to access every other node's transport publish address. You can obtain this from the nodes' logs at startup:

[2020-05-25T20:03:40,148][INFO ][o.e.t.TransportService   ] [node-0] publish_address {127.0.0.1:9300}, bound_addresses {127.0.0.1:9300}

Alternatively you can request it from a running node with GET /_nodes/_local/transport?filter_path=nodes.*.name,nodes.*.transport.publish_address (i.e. curl 'http://$HTTP_ADDRESS/_nodes/_local/transport?filter_path=nodes.*.name,nodes.*.transport.publish_address'.

1 Like

Ah sorry just saw your second response. Nice work, well done. Configuring DNS is out of scope of what I can help you with, sorry.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.