ElasticSearch cluster split-brain and timeout issues

We are having a few challenges with an ES cluster, have been suffering from some split brain issues, and some client time-outs after 2h 20 mins or so.

  • This is on CentOS Linux, the cluster is across three servers (Builder, App1, App2)
  • We are using unicast – so we think no UDP requirement for the firewall?
  • We are using zen discovery
  • We have set up tcp keepalive on all three boxes
  • We have allowed TCP connectivity on our defined ports 10021 all around
  • We have three nodes, all of which are master eligible:
  • Builder, master = yes, data =no
  • App 1 , master = yes, data =yes
  • App 2 , master = yes, data =yes

Symptoms:

  • After a period of time, the data nodes App1 or App2 think that the master has disappeared, they do an election, and choose a new master. The reason for the master disappearing seems to be some kind of closing of a TCP session (see blue text below). In the case below se-data-node-02 losses touch with both se-node-master and se-data-node-01 and therefore can’t do an election.
  • Almost immediately the re-discover the original master. This happens quite a few times

Example App02 log
Mar 16, 2017 8:15:02 PM discovery.zen
INFO: [se-data-node-02] master_left [[se-node-master][C2GM0d21ROCEwg9xlgC2rQ][BuilderServer][inet[/BuilderServerIP:10021]]{data=false, master=true}], reason [transport disconnected (with verified connect)]
Mar 16, 2017 8:15:02 PM cluster.service
INFO: [se-data-node-02] master {new [se-data-node-01][E_FSeTN9TrWPnmN39TRPTg][App1Server][inet[/App1ServerIP:10021]]{master=true}, previous [se-node-master][C2GM0d21ROCEwg9xlgC2rQ][BuilderServer ][inet[/BuilderServerIP:10021]]{data=false, master=true}}, removed {[se-node-master][C2GM0d21ROCEwg9xlgC2rQ][BuilderServer ][inet[/BuilderServerIP:10021]]{data=false, master=true},}, reason: zen-disco-master_failed ([se-node-master][C2GM0d21ROCEwg9xlgC2rQ][BuilderServer ][inet[/BuilderServerIP:10021]]{data=false, master=true})
Mar 16, 2017 8:15:05 PM discovery.zen
INFO: [se-data-node-02] master_left [[se-data-node-01][E_FSeTN9TrWPnmN39TRPTg][App1Server][inet[/App1ServerIP:10021]]{master=true}], reason [no longer master]
Mar 16, 2017 8:15:05 PM discovery.zen
WARNING: [se-data-node-02] not enough master nodes after master left (reason = no longer master), current nodes: {[se-pres-webapp-App1Server][tETV0m3FQaGrM_83Eb9YVw][App1Server][inet[/App1ServerIP:10022]]{data=false, master=false},[se-data-node-02][zZS4guf0TPuQD9OvuPZt6A][App2Server][inet[/App2ServerIP:10021]]{master=true},[vis-export-tool-BuilderServer ][4huaVm6dTrqzkGXLWZpJLA][BuilderServer][inet[/BuilderServerIP:10022]]{client=true, data=false},[viz-webapp-App1Server][H-slsFOvRNqYErdD3v8rLQ][App1Server][inet[/App1ServerIP:10023]]{client=true, data=false, master=false},}
Mar 16, 2017 8:15:05 PM cluster.service
Mar 16, 2017 8:15:15 PM cluster.service
INFO: [se-data-node-02] detected_master [se-node-master][C2GM0d21ROCEwg9xlgC2rQ……

Client node connects to the cluster, but gets a timeout exception after two hours – repeatedly..
INFO: [SysErr]: 22:05:56,107 ERROR [pool-1-thread-1:ImportBrokerLoggingFacade:51]: Error Occurred on Import Broker during Import request.
INFO: [SysErr]: org.elasticsearch.ElasticsearchTimeoutException: Timeout waiting for task.
INFO: [SysErr]: at org.elasticsearch.action.support.AdapterActionFuture.actionGet(AdapterActionFuture.java:74)

Wondering if anyone has thoughts on this - where should we start looking? Thanks!

What version?

Also, the regular timing would suggest that it's a firewall with a session limit.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.