Brand new master doesn't join cluster bootstrap error

Hi,

We have a 7.10.2 cluster with multiple data-only nodes and 3 master-only nodes.
We wiped one of the 3 masters clean (reinstall), and now it refuses to join the cluster:

master not discovered yet, this node has not previously joined a bootstrapped (v7+) cluster, and [cluster.initial_master_nodes] is empty on this node: have discovered [{node0008.example.com}{8ITgJeCmTCOqah6b_kxMDQ}{aGmuukVWTx6PzBlxPp0oaA}{10.10.239.8}{10.10.239.8:9300}{m}]; discovery will continue using [127.0.0.1:9301, 127.0.0.1:9302, 127.0.0.1:9303, 127.0.0.1:9304, 127.0.0.1:9305, [::1]:9300, [::1]:9301, [::1]:9302, [::1]:9303, [::1]:9304, [::1]:9305, [2010:660:5009:84:10:10:239:8]:9300, 10.10.239.53:9300, [2010:660:5009:84:10:10:239:53]:9300, 10.10.234.11:9300, [2010:660:5009:304:10:10:234:11]:9300, 10.10.235.58:9300, [2010:660:5009:304:10:10:235:58]:9300, 10.10.239.135:9300, [2010:660:5009:84:10:10:239:153]:9300, 10.10.234.4:9300, [2010:660:5009:304:10:10:234:4]:9300, 10.10.234.23:9300, [2010:660:5009:304:10:10:234:23]:9300, 10.10.234.115:9300, [2010:660:5009:304:10:10:234:115]:9300, 10.10.234.114:9300, [2010:660:5009:304:10:10:234:114]:9300, 10.10.234.129:9300, [2010:660:5009:304:10:10:234:129]:9300] from hosts providers and [{node0008.example.com}{8ITgJeCmTCOqah6b_kxMDQ}{aGmuukVWTx6PzBlxPp0oaA}{10.10.239.8}{10.10.239.8:9300}{m}] from last-known cluster state; node term 0, last-accepted version 0 in term 0

I tried with and without the cluster.initial_master_nodes setting (specifying all 3 master nodes).

Looks like a discovery problem, but 7.10.2 is very old (long past EOL) and newer versions have much better support for troubleshooting this kind of thing so I recommend you upgrade ASAP.

Yes, of course. But I'd rather not upgrade anything until I'm out of the water.
What do you suggest doing using 7.10.2 ?

I don't remember exactly how 7.10.2 behaves in this situation, but hopefully there's something in the logs to help. Also double-check your discovery config and inter-node connectivity.

I'm using discovery.seed_providers: file and the file unicast_hosts.txt contains a list of all data and master nodes

There is no other message in the node's logfile

Sharing your logs and config would be helpful, as otherwise we are just guessing.

The only other log entries I get are:

Exception during establishing a SSL connection: javax.net.ssl.SSLHandshakeException: Received fatal alert: handshake_failure
exception caught on transport layer [Netty4TcpChannel{localAddress=0.0.0.0/0.0.0.0:42374, remoteAddress=null}], closing connection

Here's our config:

action:
  destructive_requires_name: true
bootstrap:
  memory_lock: true
cluster:
  remote:
    connect: false
network:
  host:
  - 10.10.239.8
  - 127.0.0.1
path:
  repo:
  - /var/lib/elasticsearch-backup
discovery.seed_providers: file
cluster.name: foo
node.name: node0008.example.com
path.logs: /var/log/elasticsearch
path.data: /var/lib/elasticsearch
node.data: False
node.master: True
node.ingest: False
cluster.initial_master_nodes:
- node0008.example.com
- node0053.example.com
- node0311.example.com
network.publish_host: 10.10.239.8

The unicast_hosts.txt file:

node0008.example.com
node0053.example.com
node0311.example.com
node0614.example.com
node0135.example.com
node0304.example.com
node0323.example.com
node0415.example.com
node0414.example.com
node0429.example.com

I just tried adding another master node (the fourth one), and that one joined the cluster with no issues. Is it possible the one that got reinstalled has got some stale config in the cluster ?

Please share your logs, not just an excerpt.

Thanks for trying to help and requesting I share the full logs. While there is no other clue therein than the ones I already published, it did trigger me to pursue the root cause of the SSL Exceptions, which turned out to be the reason the node refused to join the cluster.

The root cause was unmatched X509 certificate and key.

Thanks again for your help

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.