Connection timeout to master for new nodes

jazz · November 9, 2018, 1:44pm

I have a cluster on version 2.4.1, and some nodes cannot connect to the master.
The error log is:

[WARN ][discovery.zen            ] [...] failed to connect to master [...] retrying...
org.elasticsearch.transport.ConnectTransportException: [master][x.x.x.x:9310] connect_timeout[30s]
        at org.elasticsearch.transport.netty.NettyTransport.connectToChannels(NettyTransport.java:1002) ~[elasticsearch-2.4.1.jar:2.4.1]
        at org.elasticsearch.transport.netty.NettyTransport.connectToNode(NettyTransport.java:937) ~[elasticsearch-2.4.1.jar:2.4.1]
        at org.elasticsearch.transport.netty.NettyTransport.connectToNode(NettyTransport.java:911) ~[elasticsearch-2.4.1.jar:2.4.1]
        at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:260) ~[elasticsearch-2.4.1.jar:2.4.1]
        at org.elasticsearch.discovery.zen.ZenDiscovery.joinElectedMaster(ZenDiscovery.java:444) [elasticsearch-2.4.1.jar:2.4.1]
        at org.elasticsearch.discovery.zen.ZenDiscovery.innerJoinCluster(ZenDiscovery.java:396) [elasticsearch-2.4.1.jar:2.4.1]
        at org.elasticsearch.discovery.zen.ZenDiscovery.access$4400(ZenDiscovery.java:96) [elasticsearch-2.4.1.jar:2.4.1]
        at org.elasticsearch.discovery.zen.ZenDiscovery$JoinThreadControl$1.run(ZenDiscovery.java:1296) [elasticsearch-2.4.1.jar:2.4.1]

The master has a low load, netstat shows that the node has several established connections to the master, and I can establishe a TCP connection with telnet with no issue, and the master closes immediately the connection if I send junk on this connection.
Many nodes are connected with no issues to the cluster, this issue only happens for some new nodes I try to add to the cluster.

What could cause this issue ? How can I investigate what happens?

DavidTurner · November 9, 2018, 1:50pm

This looks like a connectivity problem outside of Elasticsearch. When you try using telnet from the affected node to the master, are you sure you're using the same address and port as Elasticsearch is using Is there anything unusual about your network? Can you make sure you're using IP addresses rather than hostnames everywhere to rule out a DNS issue?

jazz · November 9, 2018, 1:56pm

I used the ip/port found in the log with telnet, so it's the same. I already use IP addresses everywhere, no DNS.
The only "unusual" thing I do is that the nodes are inside Docker, but I do the netstat/telnet inside the containers.

DavidTurner · November 9, 2018, 1:59pm

Puzzling. I think at this point I'd break out the heavy machinery of tcpdump, capture a successful connection via telnet and an unsuccessful one by Elasticsearch and see if I could spot any differences.

jazz · November 9, 2018, 2:04pm

I hoped to not have to resort to tcpdump, but I will try.

jazz · November 9, 2018, 3:39pm

It tried 2 tcpdumps between 2 nodes and the master, and this what I observe:

they both connect to the master, send a internal:discovery/zen/unicast packet, receive a response, send another internal:discovery/zen/unicast packet and get another response, then close the connection
they both open 13 connections to the master, but there is a difference here: the node that successfully connects to the master sends data on these connections, whereas the node that fails closed all these connections after 40-45s without sending anything on these connections.

I don't know how the protocol works, so I don't know how to interpret this, any ideas?

jazz · November 9, 2018, 3:44pm

I have more than 40 nodes in discovery.zen.ping.unicast.hosts, could it have an impact on this?

DavidTurner · November 9, 2018, 4:05pm

I'm more familiar with today's code than that 2.4.1's (released >2 years ago) although the basic flow looks similar to today. The connections seem to be started but not fully established. It's a little strange that the initial "probe" connection succeeds but the subsequent ones fail. Can you share the tcpdump output?

I'd normally only expect this list to contain the addresses of the master-eligible nodes, but I don't currently see how having more nodes in this list would cause what you're seeing.

jazz · November 9, 2018, 4:37pm

After removing some nodes that were down from this list, the node connected successfully to the master, but I don't understand why it fails if a node is down.

jazz · November 9, 2018, 4:41pm

The output is there.

DavidTurner · November 9, 2018, 8:46pm

Well this seems like a solution, although I also don't understand this. I don't have a development environment suitable for looking at 2.4.1 so I can only speculate.

Thanks. As far as I can see, the connections are correctly established, which wasn't what I expected. Perhaps we consume all the available threads trying to connect to unavailable nodes and end up not being able to process any of the available ones in time?

system · December 7, 2018, 8:46pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Getting ConnectTimeoutException When joining in cluster Even if nodes are reachable Elasticsearch	6	1215	October 1, 2021
Nodes are not able to connect to the master Elasticsearch	4	1245	July 6, 2017
Failed to connect to master Logstash	13	2602	July 6, 2017
Unicast discovery fails to connect to master Elasticsearch	7	2981	July 6, 2017
Slave node failed to connect with master (ELasticsearch Clustering) Elasticsearch	1	876	October 23, 2018

Connection timeout to master for new nodes

Related topics