Connection timeout to master for new nodes

I have a cluster on version 2.4.1, and some nodes cannot connect to the master.
The error log is:

[WARN ][discovery.zen            ] [...] failed to connect to master [...] retrying...
org.elasticsearch.transport.ConnectTransportException: [master][x.x.x.x:9310] connect_timeout[30s]
        at org.elasticsearch.transport.netty.NettyTransport.connectToChannels(NettyTransport.java:1002) ~[elasticsearch-2.4.1.jar:2.4.1]
        at org.elasticsearch.transport.netty.NettyTransport.connectToNode(NettyTransport.java:937) ~[elasticsearch-2.4.1.jar:2.4.1]
        at org.elasticsearch.transport.netty.NettyTransport.connectToNode(NettyTransport.java:911) ~[elasticsearch-2.4.1.jar:2.4.1]
        at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:260) ~[elasticsearch-2.4.1.jar:2.4.1]
        at org.elasticsearch.discovery.zen.ZenDiscovery.joinElectedMaster(ZenDiscovery.java:444) [elasticsearch-2.4.1.jar:2.4.1]
        at org.elasticsearch.discovery.zen.ZenDiscovery.innerJoinCluster(ZenDiscovery.java:396) [elasticsearch-2.4.1.jar:2.4.1]
        at org.elasticsearch.discovery.zen.ZenDiscovery.access$4400(ZenDiscovery.java:96) [elasticsearch-2.4.1.jar:2.4.1]
        at org.elasticsearch.discovery.zen.ZenDiscovery$JoinThreadControl$1.run(ZenDiscovery.java:1296) [elasticsearch-2.4.1.jar:2.4.1]

The master has a low load, netstat shows that the node has several established connections to the master, and I can establishe a TCP connection with telnet with no issue, and the master closes immediately the connection if I send junk on this connection.
Many nodes are connected with no issues to the cluster, this issue only happens for some new nodes I try to add to the cluster.

What could cause this issue ? How can I investigate what happens?

This looks like a connectivity problem outside of Elasticsearch. When you try using telnet from the affected node to the master, are you sure you're using the same address and port as Elasticsearch is using Is there anything unusual about your network? Can you make sure you're using IP addresses rather than hostnames everywhere to rule out a DNS issue?

I used the ip/port found in the log with telnet, so it's the same. I already use IP addresses everywhere, no DNS.
The only "unusual" thing I do is that the nodes are inside Docker, but I do the netstat/telnet inside the containers.

Puzzling. I think at this point I'd break out the heavy machinery of tcpdump, capture a successful connection via telnet and an unsuccessful one by Elasticsearch and see if I could spot any differences.

I hoped to not have to resort to tcpdump, but I will try.

It tried 2 tcpdumps between 2 nodes and the master, and this what I observe:

  • they both connect to the master, send a internal:discovery/zen/unicast packet, receive a response, send another internal:discovery/zen/unicast packet and get another response, then close the connection
  • they both open 13 connections to the master, but there is a difference here: the node that successfully connects to the master sends data on these connections, whereas the node that fails closed all these connections after 40-45s without sending anything on these connections.

I don't know how the protocol works, so I don't know how to interpret this, any ideas?

I have more than 40 nodes in discovery.zen.ping.unicast.hosts, could it have an impact on this?

I'm more familiar with today's code than that 2.4.1's (released >2 years ago) although the basic flow looks similar to today. The connections seem to be started but not fully established. It's a little strange that the initial "probe" connection succeeds but the subsequent ones fail. Can you share the tcpdump output?

I'd normally only expect this list to contain the addresses of the master-eligible nodes, but I don't currently see how having more nodes in this list would cause what you're seeing.

After removing some nodes that were down from this list, the node connected successfully to the master, but I don't understand why it fails if a node is down.

The output is there.

Well this seems like a solution, although I also don't understand this. I don't have a development environment suitable for looking at 2.4.1 so I can only speculate.

Thanks. As far as I can see, the connections are correctly established, which wasn't what I expected. Perhaps we consume all the available threads trying to connect to unavailable nodes and end up not being able to process any of the available ones in time?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.