TCP Transport versus Transport Client settings


(Ivan Brusic) #1

Trying to solve intermittent "NoNodeAvailableException: No node
available" errors that occur while searching. Cluster consists of 4
nodes running 0.19.2 using multicast. Client is a singleton
TransportClient configured with all the nodes defined in the settings.
client.transport.sniff is set to true. All other settings for either
client or server are the default. Queries have a timeout of 500ms.

Increasing the timeout limits to hopefully eliminate the problem.
First question is would it be possible to determine which node was
trying to be accessed when NoNodeAvailableException was returned?
Perhaps only one node has issue. For the transport client,
client.transport.ping_timeout would be the setting to change, but the
current default of 5 seconds seems already high. Is the transport
client communication solely to blame or could inter-node (TCP
Transport) communication be to blame as well? Should those settings be
modified as well?

Cheers,

Ivan

--


(Shay Banon) #2

If you set to debug the client.transport (or org.elasticsearch.client.transport if embedded) do you see disconnections? Can you try and use a newer 0.19 version, the logic of the transport client has been improved in later versions.

On Aug 10, 2012, at 8:38 PM, Ivan Brusic ivan@brusic.com wrote:

Trying to solve intermittent "NoNodeAvailableException: No node
available" errors that occur while searching. Cluster consists of 4
nodes running 0.19.2 using multicast. Client is a singleton
TransportClient configured with all the nodes defined in the settings.
client.transport.sniff is set to true. All other settings for either
client or server are the default. Queries have a timeout of 500ms.

Increasing the timeout limits to hopefully eliminate the problem.
First question is would it be possible to determine which node was
trying to be accessed when NoNodeAvailableException was returned?
Perhaps only one node has issue. For the transport client,
client.transport.ping_timeout would be the setting to change, but the
current default of 5 seconds seems already high. Is the transport
client communication solely to blame or could inter-node (TCP
Transport) communication be to blame as well? Should those settings be
modified as well?

Cheers,

Ivan

--

--


(Ivan Brusic) #3

Thanks Shay.

We are using Lucene 3.5 for other parts of the code on the client
side, so not quite ready to move to Lucene 3.6 (needs testing). The
issue is not consistent, so it has been difficult to reproduce
faithfully the problem. Will stress test with debug on. Should both
the client and server have debug enable. I am assuming the disconnect
is on the client side.

Ivan

On Mon, Aug 13, 2012 at 3:18 AM, Shay Banon kimchy@gmail.com wrote:

If you set to debug the client.transport (or org.elasticsearch.client.transport if embedded) do you see disconnections? Can you try and use a newer 0.19 version, the logic of the transport client has been improved in later versions.

On Aug 10, 2012, at 8:38 PM, Ivan Brusic ivan@brusic.com wrote:

Trying to solve intermittent "NoNodeAvailableException: No node
available" errors that occur while searching. Cluster consists of 4
nodes running 0.19.2 using multicast. Client is a singleton
TransportClient configured with all the nodes defined in the settings.
client.transport.sniff is set to true. All other settings for either
client or server are the default. Queries have a timeout of 500ms.

Increasing the timeout limits to hopefully eliminate the problem.
First question is would it be possible to determine which node was
trying to be accessed when NoNodeAvailableException was returned?
Perhaps only one node has issue. For the transport client,
client.transport.ping_timeout would be the setting to change, but the
current default of 5 seconds seems already high. Is the transport
client communication solely to blame or could inter-node (TCP
Transport) communication be to blame as well? Should those settings be
modified as well?

Cheers,

Ivan

--

--

--


(Shay Banon) #4

Just the client side needs logging. Also, you can safely run the transport client with Lucene 3.5 and not use Lucene 3.6.

On Aug 13, 2012, at 7:01 PM, Ivan Brusic ivan@brusic.com wrote:

Thanks Shay.

We are using Lucene 3.5 for other parts of the code on the client
side, so not quite ready to move to Lucene 3.6 (needs testing). The
issue is not consistent, so it has been difficult to reproduce
faithfully the problem. Will stress test with debug on. Should both
the client and server have debug enable. I am assuming the disconnect
is on the client side.

Ivan

On Mon, Aug 13, 2012 at 3:18 AM, Shay Banon kimchy@gmail.com wrote:

If you set to debug the client.transport (or org.elasticsearch.client.transport if embedded) do you see disconnections? Can you try and use a newer 0.19 version, the logic of the transport client has been improved in later versions.

On Aug 10, 2012, at 8:38 PM, Ivan Brusic ivan@brusic.com wrote:

Trying to solve intermittent "NoNodeAvailableException: No node
available" errors that occur while searching. Cluster consists of 4
nodes running 0.19.2 using multicast. Client is a singleton
TransportClient configured with all the nodes defined in the settings.
client.transport.sniff is set to true. All other settings for either
client or server are the default. Queries have a timeout of 500ms.

Increasing the timeout limits to hopefully eliminate the problem.
First question is would it be possible to determine which node was
trying to be accessed when NoNodeAvailableException was returned?
Perhaps only one node has issue. For the transport client,
client.transport.ping_timeout would be the setting to change, but the
current default of 5 seconds seems already high. Is the transport
client communication solely to blame or could inter-node (TCP
Transport) communication be to blame as well? Should those settings be
modified as well?

Cheers,

Ivan

--

--

--

--


(Jörg Prante) #5

Hi Ivan,

my recommendation is also upgrading from 0.19.2 to a newer version because
there were issues with TransportClient sniffing. For an example, see #1819.

Best regards,

Jörg

On Friday, August 10, 2012 8:38:46 PM UTC+2, Ivan Brusic wrote:

Trying to solve intermittent "NoNodeAvailableException: No node
available" errors that occur while searching. Cluster consists of 4
nodes running 0.19.2 using multicast. Client is a singleton
TransportClient configured with all the nodes defined in the settings.
client.transport.sniff is set to true. All other settings for either
client or server are the default. Queries have a timeout of 500ms.

Increasing the timeout limits to hopefully eliminate the problem.
First question is would it be possible to determine which node was
trying to be accessed when NoNodeAvailableException was returned?
Perhaps only one node has issue. For the transport client,
client.transport.ping_timeout would be the setting to change, but the
current default of 5 seconds seems already high. Is the transport
client communication solely to blame or could inter-node (TCP
Transport) communication be to blame as well? Should those settings be
modified as well?

Cheers,

Ivan

--


(Ivan Brusic) #6

Hopefully I would never encounter #1819 since running with no nodes is
not where I want to be!

Ran some stress tests for a few hours without causing an issue. Then
when I was running some other queries, I managed to get into the
erroneous state:

The client timed out although it was executing queries at the time.
Converted the code to Lucene 3.6 (only one API change) and will test
later with 0.19.8.

Cheers,

Ivan

On Mon, Aug 13, 2012 at 4:20 PM, Jörg Prante joergprante@gmail.com wrote:

Hi Ivan,

my recommendation is also upgrading from 0.19.2 to a newer version because
there were issues with TransportClient sniffing. For an example, see #1819.

Best regards,

Jörg

On Friday, August 10, 2012 8:38:46 PM UTC+2, Ivan Brusic wrote:

Trying to solve intermittent "NoNodeAvailableException: No node
available" errors that occur while searching. Cluster consists of 4
nodes running 0.19.2 using multicast. Client is a singleton
TransportClient configured with all the nodes defined in the settings.
client.transport.sniff is set to true. All other settings for either
client or server are the default. Queries have a timeout of 500ms.

Increasing the timeout limits to hopefully eliminate the problem.
First question is would it be possible to determine which node was
trying to be accessed when NoNodeAvailableException was returned?
Perhaps only one node has issue. For the transport client,
client.transport.ping_timeout would be the setting to change, but the
current default of 5 seconds seems already high. Is the transport
client communication solely to blame or could inter-node (TCP
Transport) communication be to blame as well? Should those settings be
modified as well?

Cheers,

Ivan

--

--


(Ivan Brusic) #7

Not only have I been able to replicate the issue, but the problem is
now consistent. Now we have been getting "RemoteTransportException ...
OutOfMemoryError" errors as well.

Here are the recent errors: https://gist.github.com/c6728e50b40a34a9c42a

The only relevant commit that I see to the Transport Client is

Will upgrade to 0.19.8 today. Just noticed another commit
(bdea0e2eddb4373b850e00d8e363c5240d78d180) that I hope gets released
soon as well (I wrote identical code, but prefer to a standard class
whenever possible).

Cheers,

Ivan

On Mon, Aug 13, 2012 at 5:42 PM, Ivan Brusic ivan@brusic.com wrote:

Hopefully I would never encounter #1819 since running with no nodes is
not where I want to be!

Ran some stress tests for a few hours without causing an issue. Then
when I was running some other queries, I managed to get into the
erroneous state:

https://gist.github.com/88222848a398f813fdb0

The client timed out although it was executing queries at the time.
Converted the code to Lucene 3.6 (only one API change) and will test
later with 0.19.8.

Cheers,

Ivan

On Mon, Aug 13, 2012 at 4:20 PM, Jörg Prante joergprante@gmail.com wrote:

Hi Ivan,

my recommendation is also upgrading from 0.19.2 to a newer version because
there were issues with TransportClient sniffing. For an example, see #1819.

Best regards,

Jörg

On Friday, August 10, 2012 8:38:46 PM UTC+2, Ivan Brusic wrote:

Trying to solve intermittent "NoNodeAvailableException: No node
available" errors that occur while searching. Cluster consists of 4
nodes running 0.19.2 using multicast. Client is a singleton
TransportClient configured with all the nodes defined in the settings.
client.transport.sniff is set to true. All other settings for either
client or server are the default. Queries have a timeout of 500ms.

Increasing the timeout limits to hopefully eliminate the problem.
First question is would it be possible to determine which node was
trying to be accessed when NoNodeAvailableException was returned?
Perhaps only one node has issue. For the transport client,
client.transport.ping_timeout would be the setting to change, but the
current default of 5 seconds seems already high. Is the transport
client communication solely to blame or could inter-node (TCP
Transport) communication be to blame as well? Should those settings be
modified as well?

Cheers,

Ivan

--

--


(system) #8