Master_not_discovered_exception after upgrade to Ubuntu Jammy and Docker 28.0.1

We are trying to upgrade our Elastic Stack deployment to run on Ubuntu Jammy (22.04) (The docker is based on Ubuntu 20.04.6 this is 7.17.12 it also happens with the 7.17.28) and docker version 28.0.1. on the hosts. Previously we were running Ubuntu Bionic (18.04) and docker version 18.06.3-ce. The elasticsearch.yaml is the same as before:

---
cluster.name: elastic-docker-cluster
network.host: pmd43test-elastic-1.platform-lab.cloud.xxx.org
network.bind_host: 0.0.0.0
cluster.initial_master_nodes: ["pmd43test-elastic-1", "pmd43test-elastic-2", "pmd43test-elastic-3"]
discovery.seed_hosts: ["pmd43test-elastic-2.ops.platform-lab.intra", "pmd43test-elastic-3.ops.platform-lab.intra"]
node.data: true
node.master: true
node.name: "pmd43test-elastic-1"
node.ingest: true
node.ml: false

## X-Pack settings
xpack.license.self_generated.type: basic

This is for node 1. The others differ only in respective node numbers.
Running a curl command on the node give this result:

$ curl "localhost:9200/_cat/indices?v" 
{"error":{"root_cause":[{"type":"master_not_discovered_exception","reason":null}],"type":"master_not_discovered_exception","reason":null},"status":503}

In the elasticsearch docker logs I see this message:

{"type": "server", "timestamp": "2025-02-28T20:28:07,663Z", "level": "WARN", "co
mponent": "o.e.c.c.ClusterFormationFailureHelper", "cluster.name": "elastic-dock
er-cluster", "node.name": "pmd43test-elastic-1", "message": "master not discover
ed yet, this node has not previously joined a bootstrapped (v7+) cluster, and th
is node must discover master-eligible nodes [pmd43test-elastic-1, pmd43test-elas
tic-2, pmd43test-elastic-3] to bootstrap a cluster: have discovered [{pmd43test-
elastic-1}{ysP7nojvR9WERKGI-w5R-g}{hpo7cInWTjKjo_JyLFMXyw}{pmd43test-elastic-1.p
latform-lab.cloud.xxx.org}{127.0.1.1:9300}{cdfhimrstw}]; discovery will continue
 using [172.17.2.61:9300, 172.17.3.30:9300] from hosts providers and [{pmd43test
-elastic-1}{ysP7nojvR9WERKGI-w5R-g}{hpo7cInWTjKjo_JyLFMXyw}{pmd43test-elastic-1.
platform-lab.cloud.xxx.org}{127.0.1.1:9300}{cdfhimrstw}] from last-known cluster state; node term 0, last-accepted version 0 in term 0" }

and this:

{"type": "server", "timestamp": "2025-02-28T20:28:05,372Z", "level": "WARN", "component": "o.e.d.PeerFinder", "cluster.name": "elastic-docker-cluster", "node.name": "pmd43test-elastic-1", "message": "address [172.17.2.61:9300], node [null], requesting [false] connection failed: [pmd43test-elastic-2][127.0.1.1:9300] handshake failed. unexpected remote node {pmd43test-elastic-1}{ysP7nojvR9WERKGI-w5R-g}{hpo7cInWTjKjo_JyLFMXyw}{pmd43test-elastic-1.platform-lab.cloud.xxx.org}{127.0.1.1:9300}{cdfhimrstw}{xpack.installed=true, transform.node=true}" }
{"type": "server", "timestamp": "2025-02-28T20:28:02,370Z", "level": "WARN", "component": "o.e.d.HandshakingTransportAddressConnector", "cluster.name": "elastic-docker-cluster", "node.name": "pmd43test-elastic-1", "message": "[connectToRemoteMasterNode[172.17.2.61:9300]] completed handshake with [{pmd43test-elastic-2}{cV8rAE6rQgOod3BvAWrWDw}{ahXw6NLvTNy8AYdvm9ld-g}{pmd43test-elastic-2.platform-lab.cloud.xxx.org}{127.0.1.1:9300}{cdfhimrstw}{xpack.installed=true, transform.node=true}] but followup connection failed", 
"stacktrace": ["org.elasticsearch.transport.ConnectTransportException: [pmd43test-elastic-2][127.0.1.1:9300] handshake failed. unexpected remote node {pmd43test-elastic-1}{ysP7nojvR9WERKGI-w5R-g}{hpo7cInWTjKjo_JyLFMXyw}{pmd43test-elastic-1.platform-lab.cloud.xxx.org}{127.0.1.1:9300}{cdfhimrstw}{xpack.installed=true, transform.node=true}",
...

I've done a lot of searching and comparing between the 2 deployments and haven't had any luck. I hope someone here can tell me where to look.

Thanks.

Is this just cut and obfuscate and paste differences, or are you actually using different hostnames here?

The ones in cluster.initial_master_nodes are node names, not hostnames, there's no requirement for them to be the same. The node names are visible in the previous message and consistent with the ones in the config file:

"node.name": "pmd43test-elastic-1"

However, as per the docs you should remove this setting as soon as the cluster has formed so you shouldn't have been running with this elasticsearch.yml file:

After the cluster has formed, remove the cluster.initial_master_nodes setting from each node’s configuration and never set it again for this cluster. Do not configure this setting on nodes joining an existing cluster. Do not configure this setting on nodes which are restarting. Do not configure this setting when performing a full-cluster restart.

On the other hand the logs indicate that you are not doing a proper upgrade:

this node has not previously joined a bootstrapped (v7+) cluster

Instead you're trying to start a brand-new cluster. Is that deliberate?

The final message unexpected remote node indicates that the nodes aren't accessible at unique addresses: 127.0.1.1:9300 connected to pmd43test-elastic-2 and then on the next attempt it reached pmd43test-elastic-1. That's not going to work.

Thanks for your reply. Yes this is a deployment of a new cluster. We're running the same version of the elastic stack with an identical configuration on an older version of Ubuntu and using an older version of docker. When I deploy a new instance of that cluster it doesn't have this problem.

Thanks for your reply. The first is a list of node names, not hostnames. intra is using an internal network. .org is an external one. The elastic instances communicate with one another over the internal network.

Just in case you missed it amongst the other important-but-not-quite-relevant stuff mentioned above, this is the fundamental issue:

Thanks again. elastic-1 is where the log message originates. So you're saying it connected with itself instead of node 2 on a second attempt using that address? I'm not sure why that's happening, but I'll take a closer look.

Ah sorry this message is a little confusing in 7.x, 8.x has a better one (and the log message even links to the relevant docs that I recommend reading).

The message above indicates that pmd43test-elastic-1 connected to 172.17.2.61:9300 and found it was talking to pmd43test-elastic-2 there, and moreover pmd43test-elastic-2 shared that its publish address (at which it must be contactable by all other nodes) was 127.0.1.1:9300. But then when pmd43test-elastic-1 connected to 127.0.1.1:9300 it found it was talking to itself there.

From the rest of your config I think that means that network.host: pmd43test-elastic-1.platform-lab.cloud.xxx.org is resolving to 127.0.1.1 (perhaps in addition to other addresses). This is probably a mistake, you want to fix your resolver so that it only resolves to 172.17.2.61 or similar.

IMO network.bind_host: 0.0.0.0 is also typically a mistake. Are you sure you want to listen for inter-node traffic on all addresses? If not, don't set network.bind_host so that it falls back to network.host. If you want to listen for HTTP traffic (but not inter-node traffic) on multiple addresses then use http.bind_host instead of network.bind_host.

Thanks very much! This is very helpful. I'll update with the solution when this problem is resolved.

I've tried removing the network.bind_host entry. There's still a problem but with a difference. I no longer see the exceptions in the logs. Instead I see the following messages over and over. It does seem that there's a probem with how the pmd43test-elastic-1.platform-lab.cloud.xxx.org FDN is getting resolved. I'm going to try substituting the IP addresses of the hosts on the respective networks in the elasticsearch.yaml instead of their host names and see if that works. There's got to be something I'm doing in this deployment that's messing things up since it works without the OS and docker version changes I've made.

{"type": "server", "timestamp": "2025-03-03T19:15:52,057Z", "level": "WARN", "component": "o.e.c.c.ClusterFormationFailureHelper", "cluster.name": "elastic-docker-cluster", "node.name": "pmd43test-elastic-1", "message": "master not discovered yet, this node has not previously joined a bootstrapped (v7+) cluster, and this node must discover master-eligible nodes [pmd43test-elastic-1, pmd43test-elastic-2, pmd43test-elastic-3] to bootstrap a cluster: have discovered [{pmd43test-elastic-1}{5VmndbPjRFucYyvO6BATUg}{AvxbRMHpQPKLPMlE1ZBQKw}{pmd43test-elastic-1.platform-lab.cloud.xxx.org}{127.0.1.1:9300}{cdfhimrstw}]; discovery will continue using [172.17.0.96:9300, 172.17.2.190:9300] from hosts providers and [{pmd43test-elastic-1}{5VmndbPjRFucYyvO6BATUg}{AvxbRMHpQPKLPMlE1ZBQKw}{pmd43test-elastic-1.platform-lab.cloud.xxx.org}{127.0.1.1:9300}{cdfhimrstw}] from last-known cluster state; node term 0, last-accepted version 0 in term 0" }


{"type": "server", "timestamp": "2025-03-03T19:15:52,125Z", "level": "WARN", "component": "o.e.d.PeerFinder", "cluster.name": "elastic-docker-cluster", "node.name": "pmd43test-elastic-1", "message": "address [172.17.2.190:9300], node [null], requesting [false] connection failed: [][172.17.2.190:9300] connect_exception: Connection refused: pmd43test-elastic-3.ops.platform-lab.intra/172.17.2.190:9300: Connection refused" }

{"type": "server", "timestamp": "2025-03-03T19:15:52,126Z", "level": "WARN", "component": "o.e.d.PeerFinder", "cluster.name": "elastic-docker-cluster", "node.name": "pmd43test-elastic-1", "message": "address [172.17.0.96:9300], node [null], requesting [false] connection failed: [][172.17.0.96:9300] connect_exception: Connection refused: pmd43test-elastic-2.ops.platform-lab.intra/172.17.0.96:9300: Connection refused" }
1 Like

Yes at least this is a little clearer:

pmd43test-elastic-2.ops.platform-lab.intra/172.17.0.96:9300: Connection refused

Look for log messages shortly after startup mentioning publish_address and bound_addresses - that should tell you how these hostnames are actually being resolved.

FWIW, I think I found the source of the problem in the difference in the /etc/hosts file on the old and new deployments. It used to look like this entry for the internal network:

172.17.0.92 pmd43test-elastic-1.ops.platform-lab.intra platform-lab-elastic-1
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4

The new one has these entries instead for the external one.

127.0.1.1 pmd43test-elastic-1.platform-lab.cloud.xxx.org pmd43test-elastic-1
127.0.0.1 localhost

This is a change in the utility that my deployment uses for creating the VMs that these instances run on. Looks like I'm going to have to fix my deployment scripts to modify the /etc/host file.

Thanks very much for your help.

1 Like

That looks like a good explanation of what you were seeing indeed

1 Like

We had a very similar resolution in another thread recently, essentially same problem as the name <-> address resolution was giving a 127.x.x.x "remote" address. In my experience, that's almost always a mistake.

1 Like

I modified the /etc/hosts files on all three nodes replace the external network entry with the internal one that we would have had before, rebooted the nodes and they came up fine. So that looks like the solution. I'll have to talk to our infrastructure guys about this because mapping an external fdn to an internal IP doesn't make sense to me, but I'm not an expert in this area. Thanks again for your help!

1 Like

What's interesting here is that the 127.0.1.1 loop back address seems to be the default for Ubuntu distributions now when the /etc/hosts file is "managed". We deploy our VMs in Openstack and I recently found this warning here:

Warning

Some distributions add an extraneous entry in the /etc/hosts file that resolves the actual hostname to another loopback IP address such as 127.0.1.1. You must comment out or remove this entry to prevent name resolution problems. Do not remove the 127.0.0.1 entry.

So this could be a more common problem.

1 Like

This is probably not the best place for this discussion, though it (the 127.0.1.1 /etc/hosts entry) is an interesting discussion point. It's certainly not new, Ubuntu has done this for a while (I installed 18.04 yesterday, noticed it there). And, eg, RedHat does not. Such is Linux-land.

Ubuntu is obviously derived from Debian and this is the documentation of this Debian policy choice, maybe also a sort of convention ?

(Chapter 5. Network setup)

I guess Ubuntu could diverge but ... presumably they did not yet do so.

There's a longish mailing thread in Debian mail archive from years ago if you are interested, the crux seems to be the wish that the hostname (what's in /etc/hostname) should be resolvable, in gethostbyname sense, independent of current network connectivity, but also to resolve to something distinct from 127.0.0.1. 127.0.1.1 was obviously chosen fairly arbitrarily, could have equally chosen 127.42.42.42 if they had a sense of humor.

You had:

which is IMHO wrong, but there are others who disagree, and its even written that the FQDN entry can be mapped to 127.0.1.1 in the hostname man page in Ubuntu, at least 18.04+.

But IMHO your main difficulty was your setup was not consistent across the 3 nodes, admittedly in a non-obvious way.

1 Like

Thanks again. There may be some justification for mapping the fqdn to 127.0.1.1 in /etc/hosts. That was consistently done on all three nodes. It used to be that our infrastructure group set up the /etc/hosts file with the IP and domain of an internal network. At some point they stopped doing that and /etc/hosts is managed to this default. I'll have to manage it differently in my deployments from now one because it appears that Elasticsearch depends on it for the handshaking in forming a cluster.