Hi,
I was doing a test with elastic search and heavy indexing and I left the
system running overnight. The system is two ES data nodes in cluster and
one TransportClient that is shared between threads in a Web Application.
The Web Application does the heavy indexing.
In the morning, I saw that the indexing finished successfully but when I
was trying to do queries or even check the status using the AdminClient
interface from the same single instance of TransportClient (it was still
the same instance used for indexing) I always got a "... disconnected"
exception. The ES data nodes in the cluster were working perfectly. I had
to restart the Web App (so the TransportClient was recreated) to be able to
connect again to the data nodes and do the queries and all that.
I was really surprised that the TransportClient didn't try to recreate the
connection to recover its own state. If you plan to use a singleton
TransportClient shared between requests and only close when the whole App
is being shutdown, the self-recovery should be a must in TransportClient.
Am I missing something with the TransportClient configuration or is there
another client variant to handle reconnection automatically?
I've never seen this issue, but your comment caught my attention.
From the server perspective, Netty (used by ES) has a default read timeout
after which it closes its side of the connection. Since the reader channel
and the writer (response) channel are different, the read timeout doesn't
know if the client isn't sending a request because:
The connection was taken down silently (firewall, laptop hibernate,
network plug pulled out of the wall, and so on).
The client is just idle.
Or, the channel actually working on the request is taking a long time,
and since the client is waiting for the response before sending its next
request, it's not really idle. But Netty cannot tell the difference between
1, 2, and 3. So its default is long enough to allow most slow requests to
be handled OK, but still short enough to prevent useless half-open sockets
to leak over time.
From the client perspective, it likely has its own timeout. And if it times
out, it usually closes the connection. A persistent connection requires
close synchronization between client and server, and if the client bails
then it cannot tell the difference between 1, 2, and 3 above, and so it
usually just closes the connection and then re-opens when it's needed again.
I haven't dived deeply into this with the TransportClient, but it seems to
behave properly in cases where the connection is closed due to being idle.
And since the client can't tell the difference between the possible reasons
for a timeout, it would seem that graceful recovery from an idle timeout
would make it recover just as gracefully as when a firewall silently takes
down an idle connection.
But in this case, the trick would be to ensure that the Netty session
timeout and the client's TransportClient timeout were both shorter than the
firewall idle session timeout. Then the client won't ever try to use a
silently closed connection.
I use also a singleton instance of TransportClient (and have unfortunately
many dozens connections) and never got disconnected by long idle
connections (in 0.19.11 and 0.90.2).
From which I understand is that TransportClient sends a small 'ping' packet
every 5 sec to the cluster to be aware about node faults.
If the connection was closed silently, neither server nor client can get
aware of the unusable connection except by sending a byte over the wire.
The message "disconnected" will appear then, but in this case it is just a
warning, and the client should reconnect to the cluster - with 'sniff' mode
enabled, even to another node.
If the connection was closed silently, neither server nor client can get
aware of the unusable connection except by sending a byte over the wire.
The message "disconnected" will appear then, but in this case it is just a
warning, and the client should reconnect to the cluster - with 'sniff' mode
enabled, even to another node.
http://blog.stephencleary.com/2009/05/detection-of-half-open-dropped.htmlis an excellent blog entry that describes the detection of half-open
(dropped) connections. In particular, the "Explicit timer assuming the
worst" recommendation should be implemented. I've done this with C++
servers and clients, and with Java servers and clients. It doesn't actually
detect them, but it does virtually eliminate client issues and completely
eliminates server resource leaks.
Interesting theory Brian, I think this is actually happening to me as I'm
seeing a slow increase in the amount of nodes in my cluster while not
adding clients to the ES cluster.
Disclaimer: The information contained in this message and attachments is
intended solely for the attention and use of the named addressee and may be
confidential. If you are not the intended recipient, you are reminded that
the information remains the property of the sender. You must not use,
disclose, distribute, copy, print or rely on this e-mail. If you have
received this message in error, please contact the sender immediately and
irrevocably delete this message and any copies.
Hi,
There are no Firewall but sure Network instability sometimes.
I got around this programmatically by updating an atomic reference to the
TransportClient with a brand new instance when that specific exception
"NodeDisconnectedException" is thrown.
Thanks!
Regards.
On Friday, September 27, 2013 10:51:42 AM UTC-7, Jörg Prante wrote:
Are there firewalls or something like that between TransportClient and
cluster?
If they take connections down silently, it is often not possible to detect
this condition properly.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.