We have a 7.10.2 cluster with multiple data-only nodes and 3 master-only nodes.
We wiped one of the 3 masters clean (reinstall), and now it refuses to join the cluster:
master not discovered yet, this node has not previously joined a bootstrapped (v7+) cluster, and [cluster.initial_master_nodes] is empty on this node: have discovered [{node0008.example.com}{8ITgJeCmTCOqah6b_kxMDQ}{aGmuukVWTx6PzBlxPp0oaA}{10.10.239.8}{10.10.239.8:9300}{m}]; discovery will continue using [127.0.0.1:9301, 127.0.0.1:9302, 127.0.0.1:9303, 127.0.0.1:9304, 127.0.0.1:9305, [::1]:9300, [::1]:9301, [::1]:9302, [::1]:9303, [::1]:9304, [::1]:9305, [2010:660:5009:84:10:10:239:8]:9300, 10.10.239.53:9300, [2010:660:5009:84:10:10:239:53]:9300, 10.10.234.11:9300, [2010:660:5009:304:10:10:234:11]:9300, 10.10.235.58:9300, [2010:660:5009:304:10:10:235:58]:9300, 10.10.239.135:9300, [2010:660:5009:84:10:10:239:153]:9300, 10.10.234.4:9300, [2010:660:5009:304:10:10:234:4]:9300, 10.10.234.23:9300, [2010:660:5009:304:10:10:234:23]:9300, 10.10.234.115:9300, [2010:660:5009:304:10:10:234:115]:9300, 10.10.234.114:9300, [2010:660:5009:304:10:10:234:114]:9300, 10.10.234.129:9300, [2010:660:5009:304:10:10:234:129]:9300] from hosts providers and [{node0008.example.com}{8ITgJeCmTCOqah6b_kxMDQ}{aGmuukVWTx6PzBlxPp0oaA}{10.10.239.8}{10.10.239.8:9300}{m}] from last-known cluster state; node term 0, last-accepted version 0 in term 0
I tried with and without the cluster.initial_master_nodes setting (specifying all 3 master nodes).
Looks like a discovery problem, but 7.10.2 is very old (long past EOL) and newer versions have much better support for troubleshooting this kind of thing so I recommend you upgrade ASAP.
I don't remember exactly how 7.10.2 behaves in this situation, but hopefully there's something in the logs to help. Also double-check your discovery config and inter-node connectivity.
Exception during establishing a SSL connection: javax.net.ssl.SSLHandshakeException: Received fatal alert: handshake_failure
exception caught on transport layer [Netty4TcpChannel{localAddress=0.0.0.0/0.0.0.0:42374, remoteAddress=null}], closing connection
I just tried adding another master node (the fourth one), and that one joined the cluster with no issues. Is it possible the one that got reinstalled has got some stale config in the cluster ?
Thanks for trying to help and requesting I share the full logs. While there is no other clue therein than the ones I already published, it did trigger me to pursue the root cause of the SSL Exceptions, which turned out to be the reason the node refused to join the cluster.
The root cause was unmatched X509 certificate and key.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.