Transport client frequent "node disconnected" messages

Chris_Gatihi · April 24, 2019, 3:02pm

Hi,

We are on ES 5.5 and use the Transport Client with sniff set to true to connect our services to our ES cluster via a load balancer. We only need the load balancer to discover the nodes (they change dynamically) but after we establish a connection to them (which we keep open and don't close) we don't want to connect to the load balancer.

As far as I understand, addresses associated with the Transport Client via TransportClient#addTransportAddress(TransportAddress) are only used for initial connection after which the transport will discover the data nodes to maintain connection to.

But we are seeing lots of these messages in our logs:

org.elasticsearch.transport.ConnectTransportException: [][192.168.192.9:9300] connect_timeout[30s]

Where the IP specified is for the load balancer.

Why would we be seeing these messages? Do we need to use the TransportClient#removeTransportAddress(TransportAddress) functionality?

DavidTurner · April 24, 2019, 3:33pm

The short answer is that this node is timing out trying to connect to your load balancer, which suggests there might be some problem with the load balancer. Are you seeing these messages even when the load balancer is healthy?

I think that the client will expect to be able to connect to all the addresses you give it via addTransportAddress on an ongoing basis. Although these addresses are mostly used for the initial connections, they may also be important if the client needs to reconnect to the cluster.

Chris_Gatihi · April 24, 2019, 6:40pm

Thanks for your reply @DavidTurner. These are happening consistently for us (almost every minute), so either that means the load balancer is never healthy or that they are happening even when the load balancer is healthy.

We are passing a third parameter to the method PreBuiltXPackTransportClient(Settings settings, Collection<Class<? extends Plugin>> plugins, HostFailureListener hostFailureListener) when instantiating our Transport Client connection.

We are seeing this listener getting called when the node disconnects after establishing a connection. I assume this would also get called if the initial connection fails to be established?

DavidTurner · April 25, 2019, 9:35am

Ah, sorry, I misread the exception message. The connect_timeout[30s] message is always included, even if the connection attempt didn't time out.

Could you share the whole exception, including any stack traces and any inner exceptions (and their stack traces and so on).

Chris_Gatihi · April 25, 2019, 1:34pm

Oh, interesting. That seems like it could be misleading.

Here's all I see for the stacktrace:

org.elasticsearch.transport.ConnectTransportException: [][192.168.192.9:9300] connect_timeout[30s]
	at org.elasticsearch.transport.netty4.Netty4Transport.connectToChannels(Netty4Transport.java:361)
	at org.elasticsearch.transport.TcpTransport.openConnection(TcpTransport.java:548)
	at org.elasticsearch.transport.TcpTransport.openConnection(TcpTransport.java:116)
	at org.elasticsearch.transport.TransportService.openConnection(TransportService.java:351)
	at org.elasticsearch.client.transport.TransportClientNodesService$SniffNodesSampler$1.doRun(TransportClientNodesService.java:506)
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638)
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: io.netty.channel.ConnectTimeoutException: connection timed out: internal-FGHW3GK20XPB-360902321.eu-central-1.elb.amazonaws.com/192.168.192.9:9300
	at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe$1.run(AbstractNioChannel.java:267)
	at io.netty.util.concurrent.PromiseTask$RunnableAdapter.call(PromiseTask.java:38)
	at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:120)
	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:462)
	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
	... 1 common frames omitted

DavidTurner · April 25, 2019, 1:47pm

Definitely a timeout

I'd be tempted to grab the traffic with tcpdump and look for connections that aren't being opened properly. That would give us a definite answer about whether we should look harder at the load balancer or at the Elasticsearch side.

system · May 23, 2019, 1:47pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
ES 1.4.0 TransportClient intermittent disconnects (NodeDisconnectedException) Elasticsearch	1	384	July 6, 2017
Understanding sniff with load balancer Elasticsearch	1	754	June 2, 2017
Problem with Transport client and elasticsearch cluster behind load balancer Elasticsearch	3	3396	July 5, 2017
TransportClient#addTransportAddress hangs when cluster node is down or cluster is rebalancing. Elasticsearch	1	393	March 28, 2014
NodeDisconnectedException - help! Elasticsearch	1	440	July 6, 2017

Transport client frequent "node disconnected" messages

Related topics