Possible causes for 'transport disconnected' errors in node discovery?

On our live cluster we have recently encountered some connectivity issues.
Our cluster is spread across two physical data centres with a dedicated
private link between the two, that is 'fairly' reliable. From the various
comments I've read so far though, a multi-data centre cluster is not
currently advised, so I'm anticipating that kind of response.

But anyway, below is the log from one of our 'coordinator' nodes (i.e.
data=false, master=true), which shows the node has failed its zen discovery
with other nodes in the cluster in the other data centre, with a reason of
'transport disconnected (with verified connect)'. The monitoring data for
our data centre link shows that no link outage occurred in the specified
time period, though there was some increase in traffic across the link.

Looking at the ES source code, and I have seen other posts that support
this, I would expect to see a different message if the reason for the
failed connectivity was due to the traffic increase - the ping request
would have timed out (failed to ping [{}], tried [{}] times, each with
maximum [{}]).

So my question is this:

what are the likely causes of a 'transport disconnect' in this situation?

And is my expectation of an ES cluster to work over this architecture a
naive one?

(Note, I've changed some of the ip addresses for security reasons, and
reformatted it to aid readability).


[2013-09-15 19:34:02,845][INFO ][cluster.service]
[live_SQLWOK11_coordinator]

removed
{
[live_SQLLIVE25][FDmaoEyDSni109A5nY_kcg][inet[/999.86.1.40:53028]]{datacentrename=London,
nodename=live_SQLLIVE25, master=false},
},

reason:
zen-disco-node_failed([live_SQLLIVE25][FDmaoEyDSni109A5nY_kcg][inet[/999.86.1.40:53028]]{datacentrename=London,
nodename=live_SQLLIVE25, master=false}),

reason transport disconnected (with verified connect)

[2013-09-15 19:34:03,140][INFO ][cluster.service]
[live_SQLWOK11_coordinator]

removed
{
[live_SQLLIVE24_loadbalancer][KgZ0hKRuRj6sIafa7eXQyA][inet[/999.86.1.38:54604]]{datacentrename=London,
data=false, nodename=live_SQLLIVE24_loadbalancer, master=false},
},

reason:
zen-disco-node_failed([live_SQLLIVE24_loadbalancer][KgZ0hKRuRj6sIafa7eXQyA][inet[/999.86.1.38:54604]]{datacentrename=London,
data=false, nodename=live_SQLLIVE24_loadbalancer, master=false}),

reason transport disconnected (with verified connect)

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Is it possible you had no traffic between the locations for some hours? If
so, ES needs tcp keepalive message on the long living connections to keep
them persistent. Check your underlying OS tcp keepalive timeout (default on
Linux is something of 7200 seconds) which should be as low as, e.g. 600
seconds, after that time, the first tcp keepalive message is sent. Also
consider a lower interval for the keepalive messages.

I found this hint at

https://groups.google.com/forum/#!msg/elasticsearch/c9JmaiVfBb0/9XZM6ZJpoBwJ

More info

http://tldp.org/HOWTO/TCP-Keepalive-HOWTO/usingkeepalive.html

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

1 Like

Thanks very much for your reply. I'm quickly increasing my knowledge of
tcp and keep alives.

Yes it is very likely there were no writes to those nodes for a long time,
though there were reads.

I will try reducing the tcp keep alive time on the nodes, though it seems a
shame to have to change these machine wide settings to keep the ES cluster
healthy during quiet periods - I would expect a product like ES to be more
resilient than this by default. And we have the added restriction that a
SAN connected to one of the servers requires specific tcp keep alive
settings.

I'm still struggling to understand exactly what's going on here... ES is
sending a ping message every second, and doing it via the same transport
object on which the disconnect occurs (if my understanding of the code is
correct), and hence the same tcp connection. The ping message is
effectively data packets going across the tcp connection, removing the
importance/need for tcp keep alives to maintain the connection. So I'm
confused as to why the keep-alive is important. Looking closer at the
code, when the transport to a given node disconnects, it attempts to
establish the connection again once, which is what the "with verified
connect" part of the log message seems to refer to.

On Tuesday, September 17, 2013 8:49:15 PM UTC+1, Jörg Prante wrote:

Is it possible you had no traffic between the locations for some hours? If
so, ES needs tcp keepalive message on the long living connections to keep
them persistent. Check your underlying OS tcp keepalive timeout (default on
Linux is something of 7200 seconds) which should be as low as, e.g. 600
seconds, after that time, the first tcp keepalive message is sent. Also
consider a lower interval for the keepalive messages.

I found this hint at

Redirecting to Google Groups

More info

Using TCP keepalive under Linux

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

I'm quite sure there is a firewall to get passed between the "coordinator"
node and the other nodes. The message "transport disconnected (with
verified connect)" indicates the connection was declared invalid from
outside ES. And ES can not do much to cope with that situation from inside.
The idea is that decreasing the Linux tcp keepalive timeout may persuade a
firewall not to take down a network connection after a certain firewall
timeout. Even if pings are sent each second, 3 retries that fail for
whatever reason can be enough for ES to assume a disconnected channel. I
wonder if debug logs could show more info, like I/O exceptions.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.