TCP Transport versus Transport Client settings

Ivan · August 10, 2012, 6:38pm

Trying to solve intermittent "NoNodeAvailableException: No node
available" errors that occur while searching. Cluster consists of 4
nodes running 0.19.2 using multicast. Client is a singleton
TransportClient configured with all the nodes defined in the settings.
client.transport.sniff is set to true. All other settings for either
client or server are the default. Queries have a timeout of 500ms.

Increasing the timeout limits to hopefully eliminate the problem.
First question is would it be possible to determine which node was
trying to be accessed when NoNodeAvailableException was returned?
Perhaps only one node has issue. For the transport client,
client.transport.ping_timeout would be the setting to change, but the
current default of 5 seconds seems already high. Is the transport
client communication solely to blame or could inter-node (TCP
Transport) communication be to blame as well? Should those settings be
modified as well?

Cheers,

Ivan

--

kimchy · August 13, 2012, 10:18am

If you set to debug the client.transport (or org.elasticsearch.client.transport if embedded) do you see disconnections? Can you try and use a newer 0.19 version, the logic of the transport client has been improved in later versions.

On Aug 10, 2012, at 8:38 PM, Ivan Brusic ivan@brusic.com wrote:

Trying to solve intermittent "NoNodeAvailableException: No node
available" errors that occur while searching. Cluster consists of 4
nodes running 0.19.2 using multicast. Client is a singleton
TransportClient configured with all the nodes defined in the settings.
client.transport.sniff is set to true. All other settings for either
client or server are the default. Queries have a timeout of 500ms.

Increasing the timeout limits to hopefully eliminate the problem.
First question is would it be possible to determine which node was
trying to be accessed when NoNodeAvailableException was returned?
Perhaps only one node has issue. For the transport client,
client.transport.ping_timeout would be the setting to change, but the
current default of 5 seconds seems already high. Is the transport
client communication solely to blame or could inter-node (TCP
Transport) communication be to blame as well? Should those settings be
modified as well?

Cheers,

Ivan

--

--

Ivan · August 13, 2012, 5:01pm

Thanks Shay.

We are using Lucene 3.5 for other parts of the code on the client
side, so not quite ready to move to Lucene 3.6 (needs testing). The
issue is not consistent, so it has been difficult to reproduce
faithfully the problem. Will stress test with debug on. Should both
the client and server have debug enable. I am assuming the disconnect
is on the client side.

Ivan

On Mon, Aug 13, 2012 at 3:18 AM, Shay Banon kimchy@gmail.com wrote:

If you set to debug the client.transport (or org.elasticsearch.client.transport if embedded) do you see disconnections? Can you try and use a newer 0.19 version, the logic of the transport client has been improved in later versions.

On Aug 10, 2012, at 8:38 PM, Ivan Brusic ivan@brusic.com wrote:

Trying to solve intermittent "NoNodeAvailableException: No node
available" errors that occur while searching. Cluster consists of 4
nodes running 0.19.2 using multicast. Client is a singleton
TransportClient configured with all the nodes defined in the settings.
client.transport.sniff is set to true. All other settings for either
client or server are the default. Queries have a timeout of 500ms.

Increasing the timeout limits to hopefully eliminate the problem.
First question is would it be possible to determine which node was
trying to be accessed when NoNodeAvailableException was returned?
Perhaps only one node has issue. For the transport client,
client.transport.ping_timeout would be the setting to change, but the
current default of 5 seconds seems already high. Is the transport
client communication solely to blame or could inter-node (TCP
Transport) communication be to blame as well? Should those settings be
modified as well?

Cheers,

Ivan

--

--

--

kimchy · August 13, 2012, 7:04pm

Just the client side needs logging. Also, you can safely run the transport client with Lucene 3.5 and not use Lucene 3.6.

On Aug 13, 2012, at 7:01 PM, Ivan Brusic ivan@brusic.com wrote:

Thanks Shay.

We are using Lucene 3.5 for other parts of the code on the client
side, so not quite ready to move to Lucene 3.6 (needs testing). The
issue is not consistent, so it has been difficult to reproduce
faithfully the problem. Will stress test with debug on. Should both
the client and server have debug enable. I am assuming the disconnect
is on the client side.

Ivan

On Mon, Aug 13, 2012 at 3:18 AM, Shay Banon kimchy@gmail.com wrote:

If you set to debug the client.transport (or org.elasticsearch.client.transport if embedded) do you see disconnections? Can you try and use a newer 0.19 version, the logic of the transport client has been improved in later versions.

On Aug 10, 2012, at 8:38 PM, Ivan Brusic ivan@brusic.com wrote:

Trying to solve intermittent "NoNodeAvailableException: No node
available" errors that occur while searching. Cluster consists of 4
nodes running 0.19.2 using multicast. Client is a singleton
TransportClient configured with all the nodes defined in the settings.
client.transport.sniff is set to true. All other settings for either
client or server are the default. Queries have a timeout of 500ms.

Increasing the timeout limits to hopefully eliminate the problem.
First question is would it be possible to determine which node was
trying to be accessed when NoNodeAvailableException was returned?
Perhaps only one node has issue. For the transport client,
client.transport.ping_timeout would be the setting to change, but the
current default of 5 seconds seems already high. Is the transport
client communication solely to blame or could inter-node (TCP
Transport) communication be to blame as well? Should those settings be
modified as well?

Cheers,

Ivan

--

--

--

--

jprante · August 13, 2012, 11:20pm

Hi Ivan,

my recommendation is also upgrading from 0.19.2 to a newer version because
there were issues with TransportClient sniffing. For an example, see #1819.

Best regards,

Jörg

On Friday, August 10, 2012 8:38:46 PM UTC+2, Ivan Brusic wrote:

Trying to solve intermittent "NoNodeAvailableException: No node
available" errors that occur while searching. Cluster consists of 4
nodes running 0.19.2 using multicast. Client is a singleton
TransportClient configured with all the nodes defined in the settings.
client.transport.sniff is set to true. All other settings for either
client or server are the default. Queries have a timeout of 500ms.

Increasing the timeout limits to hopefully eliminate the problem.
First question is would it be possible to determine which node was
trying to be accessed when NoNodeAvailableException was returned?
Perhaps only one node has issue. For the transport client,
client.transport.ping_timeout would be the setting to change, but the
current default of 5 seconds seems already high. Is the transport
client communication solely to blame or could inter-node (TCP
Transport) communication be to blame as well? Should those settings be
modified as well?

Cheers,

Ivan

--

Ivan · August 14, 2012, 12:42am

Hopefully I would never encounter #1819 since running with no nodes is
not where I want to be!

Ran some stress tests for a few hours without causing an issue. Then
when I was running some other queries, I managed to get into the
erroneous state:

gist.github.com

https://gist.github.com/brusic/88222848a398f813fdb0

1

20120813.173009 [INFO] elasticsearch[generic]-pool-1-thread-2 org.elasticsearch.client.transport - [Man-Elephant] failed to get node info for [srch-dv106][l10HlCg5TO6mzk9yo-tGfQ][inet[/192.168.50.106:9300]], disconnecting...
org.elasticsearch.transport.ReceiveTimeoutTransportException: [srch-dv106][inet[/192.168.50.106:9300]][cluster/nodes/info] request_id [1305] timed out after [4996ms]
	at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:347)
	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
	at java.lang.Thread.run(Thread.java:662)
20120813.173009 [DEBUG] elasticsearch[generic]-pool-1-thread-2 org.elasticsearch.transport.netty - [Man-Elephant] Disconnected from [[srch-dv106][l10HlCg5TO6mzk9yo-tGfQ][inet[/192.168.50.106:9300]]]
20120813.173009 [DEBUG] elasticsearch[generic]-pool-1-thread-1 org.elasticsearch.transport.netty - [Man-Elephant] Connected to node [[srch-dv106][l10HlCg5TO6mzk9yo-tGfQ][inet[/192.168.50.106:9300]]]
20120813.173019 [INFO] elasticsearch[generic]-pool-1-thread-2 org.elasticsearch.client.transport - [Man-Elephant] failed to get node info for [srch-dv104][vK2CxnaNThGyLMkPbHYc2g][inet[/192.168.50.104:9300]], disconnecting...
org.elasticsearch.transport.ReceiveTimeoutTransportException: [srch-dv104][inet[/192.168.50.104:9300]][cluster/nodes/info] request_id [1309] timed out after [4988ms]

This file has been truncated. show original

2

org.elasticsearch.client.transport.NoNodeAvailableException: No node available
	at org.elasticsearch.client.transport.TransportClientNodesService$RetryListener.onFailure(TransportClientNodesService.java:214)
	at org.elasticsearch.client.transport.TransportClientNodesService$RetryListener.onFailure(TransportClientNodesService.java:220)
	at org.elasticsearch.client.transport.TransportClientNodesService$RetryListener.onFailure(TransportClientNodesService.java:220)
	at org.elasticsearch.action.TransportActionNodeProxy$1.handleException(TransportActionNodeProxy.java:77)
	at org.elasticsearch.transport.TransportService$Adapter$2$1.run(TransportService.java:310)
	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
	at java.lang.Thread.run(Thread.java:662)

3

20120813.173334 [DEBUG] elasticsearch[management]-pool-7-thread-2 org.elasticsearch.client.transport - [Man-Elephant] failed to connect to node [[#transport#-4][inet[/192.168.50.106:9300]]], ignoring...
org.elasticsearch.transport.ConnectTransportException: [][inet[/192.168.50.106:9300]] connect_timeout[30s]
	at org.elasticsearch.transport.netty.NettyTransport.connectToChannels(NettyTransport.java:560)
	at org.elasticsearch.transport.netty.NettyTransport.connectToNode(NettyTransport.java:503)
	at org.elasticsearch.transport.netty.NettyTransport.connectToNode(NettyTransport.java:482)
	at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:128)
	at org.elasticsearch.client.transport.TransportClientNodesService$SniffNodesSampler$1.run(TransportClientNodesService.java:327)
	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
20120813.173334 [DEBUG] elasticsearch[management]-pool-7-thread-4 org.elasticsearch.client.transport - [Man-Elephant] failed to connect to node [[#transport#-2][inet[/192.168.50.104:9300]]], ignoring...

This file has been truncated. show original

The client timed out although it was executing queries at the time.
Converted the code to Lucene 3.6 (only one API change) and will test
later with 0.19.8.

Cheers,

Ivan

On Mon, Aug 13, 2012 at 4:20 PM, Jörg Prante joergprante@gmail.com wrote:

Hi Ivan,

my recommendation is also upgrading from 0.19.2 to a newer version because
there were issues with TransportClient sniffing. For an example, see #1819.

Best regards,

Jörg

On Friday, August 10, 2012 8:38:46 PM UTC+2, Ivan Brusic wrote:

Trying to solve intermittent "NoNodeAvailableException: No node
available" errors that occur while searching. Cluster consists of 4
nodes running 0.19.2 using multicast. Client is a singleton
TransportClient configured with all the nodes defined in the settings.
client.transport.sniff is set to true. All other settings for either
client or server are the default. Queries have a timeout of 500ms.

Increasing the timeout limits to hopefully eliminate the problem.
First question is would it be possible to determine which node was
trying to be accessed when NoNodeAvailableException was returned?
Perhaps only one node has issue. For the transport client,
client.transport.ping_timeout would be the setting to change, but the
current default of 5 seconds seems already high. Is the transport
client communication solely to blame or could inter-node (TCP
Transport) communication be to blame as well? Should those settings be
modified as well?

Cheers,

Ivan

--

--

Ivan · August 14, 2012, 7:17pm

Not only have I been able to replicate the issue, but the problem is
now consistent. Now we have been getting "RemoteTransportException ...
OutOfMemoryError" errors as well.

Here are the recent errors: More "no node available" errors · GitHub

The only relevant commit that I see to the Transport Client is

Will upgrade to 0.19.8 today. Just noticed another commit
(bdea0e2eddb4373b850e00d8e363c5240d78d180) that I hope gets released
soon as well (I wrote identical code, but prefer to a standard class
whenever possible).

Cheers,

Ivan

On Mon, Aug 13, 2012 at 5:42 PM, Ivan Brusic ivan@brusic.com wrote:

Hopefully I would never encounter #1819 since running with no nodes is
not where I want to be!

Ran some stress tests for a few hours without causing an issue. Then
when I was running some other queries, I managed to get into the
erroneous state:

NoNodeAvailableException: No node available · GitHub

The client timed out although it was executing queries at the time.
Converted the code to Lucene 3.6 (only one API change) and will test
later with 0.19.8.

Cheers,

Ivan

On Mon, Aug 13, 2012 at 4:20 PM, Jörg Prante joergprante@gmail.com wrote:

Hi Ivan,

my recommendation is also upgrading from 0.19.2 to a newer version because
there were issues with TransportClient sniffing. For an example, see #1819.

Best regards,

Jörg

On Friday, August 10, 2012 8:38:46 PM UTC+2, Ivan Brusic wrote:

Trying to solve intermittent "NoNodeAvailableException: No node
available" errors that occur while searching. Cluster consists of 4
nodes running 0.19.2 using multicast. Client is a singleton
TransportClient configured with all the nodes defined in the settings.
client.transport.sniff is set to true. All other settings for either
client or server are the default. Queries have a timeout of 500ms.

Increasing the timeout limits to hopefully eliminate the problem.
First question is would it be possible to determine which node was
trying to be accessed when NoNodeAvailableException was returned?
Perhaps only one node has issue. For the transport client,
client.transport.ping_timeout would be the setting to change, but the
current default of 5 seconds seems already high. Is the transport
client communication solely to blame or could inter-node (TCP
Transport) communication be to blame as well? Should those settings be
modified as well?

Cheers,

Ivan

--

--

Topic		Replies	Views
Transport client ping timeout + no node available exception Elasticsearch	9	3426	July 6, 2017
Transport Client Error both on 5.3.2 and 5.4.0 Elasticsearch	6	1404	June 26, 2017
No node available exception Elasticsearch	3	341	July 6, 2017
TransportClient NoNodeAvailableException Elasticsearch	4	747	November 22, 2017
TransportClient and NoNodeAvailableException Elasticsearch	6	595	July 6, 2017

TCP Transport versus Transport Client settings

Related topics