Idle tcp sessions timeout leads to "transport disconnected" error & data loss

(Sergey Romanovsky) #1

I have a multi datacenter ES cluster. In our network idle WAN tcp sessions are getting closed after some timeout (it's 60 minutes now).
Strange thing but it seems that ES isn't tolerant for that. I can see that every 60 minutes nodes report to the log "master left (reason = transport disconnected)". Several seconds later nodes successfully detect master and join the cluster back.
I thought ES sends keepalive packets to keep session alive and tries to re-establish the connection to the master in case of if it breaks before leave the cluster. But seems that it isn't true.

Another problem:
if you were putting data at the moment of cluster nodes disconnect you'll lose some (about 3 minutes) of data (despite acknowledgement that all documents were accepted). All documents submitted one by one, no bulk api.

Could you help me to find the clue for the problem?

(Mark Walkom) #2

ES does send "ping" packets to make sure nodes are alive, so chances are there is something your WAN hardware is going.

But it was never built to handle cross-DC clusters and it's something we don't support.

(system) #3