I am trying to add a couple of nodes to an existing cluster of 3 nodes. These "new" nodes have been in the cluster in the past but were taken out for "refurbishment" and now I am having issues getting them back in.
seeing this in the logs of the machine that is trying to join:
[2021-06-14T10:08:11,368][WARN ][o.e.c.c.JoinHelper ] [secesprd05] last failed join attempt was 7.7s ago, failed to join {secesprd01}{kAWPcpoxSNSN9WlUsYlQlg}{pKGIqAxXRTy4NHxIq2HgwA}{10.6.0.67}{10.6.0.67:9300}{cdhmw}{xpack.installed=true, molochtype=hot, transform.node=false} with JoinRequest{sourceNode={secesprd05}{4cPiEfloRoKgvx-NqVp4aA}{fnNOdzfIT66oVhACJLUptg}{130.216.236.212}{130.216.236.212:9300}{c}{xpack.installed=true, molochtype=none, transform.node=false}, minimumTerm=21, optionalJoin=Optional.empty}
org.elasticsearch.transport.RemoteTransportException: [secesprd01][10.6.0.67:9300][internal:cluster/coordination/join]
Caused by: org.elasticsearch.transport.ConnectTransportException: [secesprd05]. [130.216.236.212:9300] connect_timeout[30s]
at org.elasticsearch.transport.TcpTransport$ChannelsConnectedListener.onTimeout(TcpTransport.java:984) ~[elasticsearch-7.10.1.jar:7.10.1]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:678) ~[elasticsearch-7.10.1.jar:7.10.1]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) [?:?]
at java.lang.Thread.run(Thread.java:832) [?:?]
[2021-06-14T10:08:11,369][WARN ][o.e.c.c.ClusterFormationFailureHelper] [secesprd05] master not discovered yet: have discovered [{secesprd05}{4cPiEfloRoKgvx-NqVp4aA}{fnNOdzfIT66oVhACJLUptg}{130.216.236.212}{130.216.236.212:9300}{c}{xpack.installed=true, molochtype=none, transform.node=false}, {secesprd01}{kAWPcpoxSNSN9WlUsYlQlg}{pKGIqAxXRTy4NHxIq2HgwA}{10.6.0.67}{10.6.0.67:9300}{cdhmw}{xpack.installed=true, molochtype=hot, transform.node=false}, {secesprd02}{6UDagJW2T3eWM-0PQJ0rMA}{HLQJOMv1SpOCPfcJZqe2dg}{10.6.0.68}{10.6.0.68:9300}{cdhmw}{xpack.installed=true, molochtype=hot, transform.node=false}, {secmonprd07}{TNHldGyAQ52sNlIbGPbgMg}{QQ3Iau6fQaKF4-eqT6oGDQ}{130.216.5.111}{130.216.5.111:9300}{dmw}{xpack.installed=true, molochtype=warm, transform.node=false}]; discovery will continue using [10.6.0.67:9300, 10.6.0.68:9300, 130.216.5.111:9300] from hosts providers and [] from last-known cluster state; node term 21, last-accepted version 181555 in term 6
I.e. it is complaining about timeout.
When I run tcpdump I can see that the nodes are communicating on port 9300 with and there are no obvious errors or timeouts. (Pcaps available on request).
There are no entries in the logs of the other nodes that indicate anything amiss (actually there are no logs for the time period at all).
I have tried restarting one of the existing nodes but that made no difference.
Both the new nodes show identical symptoms.
At a loss as to what to check next.