Transport client frequent "node disconnected" messages

Hi,

We are on ES 5.5 and use the Transport Client with sniff set to true to connect our services to our ES cluster via a load balancer. We only need the load balancer to discover the nodes (they change dynamically) but after we establish a connection to them (which we keep open and don't close) we don't want to connect to the load balancer.

As far as I understand, addresses associated with the Transport Client via TransportClient#addTransportAddress(TransportAddress) are only used for initial connection after which the transport will discover the data nodes to maintain connection to.

But we are seeing lots of these messages in our logs:

org.elasticsearch.transport.ConnectTransportException: [][192.168.192.9:9300] connect_timeout[30s]

Where the IP specified is for the load balancer.

Why would we be seeing these messages? Do we need to use the TransportClient#removeTransportAddress(TransportAddress) functionality?

The short answer is that this node is timing out trying to connect to your load balancer, which suggests there might be some problem with the load balancer. Are you seeing these messages even when the load balancer is healthy?

I think that the client will expect to be able to connect to all the addresses you give it via addTransportAddress on an ongoing basis. Although these addresses are mostly used for the initial connections, they may also be important if the client needs to reconnect to the cluster.

Thanks for your reply @DavidTurner. These are happening consistently for us (almost every minute), so either that means the load balancer is never healthy or that they are happening even when the load balancer is healthy.

We are passing a third parameter to the method PreBuiltXPackTransportClient(Settings settings, Collection<Class<? extends Plugin>> plugins, HostFailureListener hostFailureListener) when instantiating our Transport Client connection.

We are seeing this listener getting called when the node disconnects after establishing a connection. I assume this would also get called if the initial connection fails to be established?

Ah, sorry, I misread the exception message. The connect_timeout[30s] message is always included, even if the connection attempt didn't time out.

Could you share the whole exception, including any stack traces and any inner exceptions (and their stack traces and so on).

Oh, interesting. That seems like it could be misleading.

Here's all I see for the stacktrace:

org.elasticsearch.transport.ConnectTransportException: [][192.168.192.9:9300] connect_timeout[30s]
	at org.elasticsearch.transport.netty4.Netty4Transport.connectToChannels(Netty4Transport.java:361)
	at org.elasticsearch.transport.TcpTransport.openConnection(TcpTransport.java:548)
	at org.elasticsearch.transport.TcpTransport.openConnection(TcpTransport.java:116)
	at org.elasticsearch.transport.TransportService.openConnection(TransportService.java:351)
	at org.elasticsearch.client.transport.TransportClientNodesService$SniffNodesSampler$1.doRun(TransportClientNodesService.java:506)
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638)
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: io.netty.channel.ConnectTimeoutException: connection timed out: internal-FGHW3GK20XPB-360902321.eu-central-1.elb.amazonaws.com/192.168.192.9:9300
	at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe$1.run(AbstractNioChannel.java:267)
	at io.netty.util.concurrent.PromiseTask$RunnableAdapter.call(PromiseTask.java:38)
	at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:120)
	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:462)
	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
	... 1 common frames omitted

Definitely a timeout :frowning:

I'd be tempted to grab the traffic with tcpdump and look for connections that aren't being opened properly. That would give us a definite answer about whether we should look harder at the load balancer or at the Elasticsearch side.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.