We are having a few challenges with an ES cluster, have been suffering from some split brain issues, and some client time-outs after 2h 20 mins or so.
- This is on CentOS Linux, the cluster is across three servers (Builder, App1, App2)
- We are using unicast – so we think no UDP requirement for the firewall?
- We are using zen discovery
- We have set up tcp keepalive on all three boxes
- We have allowed TCP connectivity on our defined ports 10021 all around
- We have three nodes, all of which are master eligible:
- Builder, master = yes, data =no
- App 1 , master = yes, data =yes
- App 2 , master = yes, data =yes
Symptoms:
- After a period of time, the data nodes App1 or App2 think that the master has disappeared, they do an election, and choose a new master. The reason for the master disappearing seems to be some kind of closing of a TCP session (see blue text below). In the case below se-data-node-02 losses touch with both se-node-master and se-data-node-01 and therefore can’t do an election.
- Almost immediately the re-discover the original master. This happens quite a few times
Example App02 log
Mar 16, 2017 8:15:02 PM discovery.zen
INFO: [se-data-node-02] master_left [[se-node-master][C2GM0d21ROCEwg9xlgC2rQ][BuilderServer][inet[/BuilderServerIP:10021]]{data=false, master=true}], reason [transport disconnected (with verified connect)]
Mar 16, 2017 8:15:02 PM cluster.service
INFO: [se-data-node-02] master {new [se-data-node-01][E_FSeTN9TrWPnmN39TRPTg][App1Server][inet[/App1ServerIP:10021]]{master=true}, previous [se-node-master][C2GM0d21ROCEwg9xlgC2rQ][BuilderServer ][inet[/BuilderServerIP:10021]]{data=false, master=true}}, removed {[se-node-master][C2GM0d21ROCEwg9xlgC2rQ][BuilderServer ][inet[/BuilderServerIP:10021]]{data=false, master=true},}, reason: zen-disco-master_failed ([se-node-master][C2GM0d21ROCEwg9xlgC2rQ][BuilderServer ][inet[/BuilderServerIP:10021]]{data=false, master=true})
Mar 16, 2017 8:15:05 PM discovery.zen
INFO: [se-data-node-02] master_left [[se-data-node-01][E_FSeTN9TrWPnmN39TRPTg][App1Server][inet[/App1ServerIP:10021]]{master=true}], reason [no longer master]
Mar 16, 2017 8:15:05 PM discovery.zen
WARNING: [se-data-node-02] not enough master nodes after master left (reason = no longer master), current nodes: {[se-pres-webapp-App1Server][tETV0m3FQaGrM_83Eb9YVw][App1Server][inet[/App1ServerIP:10022]]{data=false, master=false},[se-data-node-02][zZS4guf0TPuQD9OvuPZt6A][App2Server][inet[/App2ServerIP:10021]]{master=true},[vis-export-tool-BuilderServer ][4huaVm6dTrqzkGXLWZpJLA][BuilderServer][inet[/BuilderServerIP:10022]]{client=true, data=false},[viz-webapp-App1Server][H-slsFOvRNqYErdD3v8rLQ][App1Server][inet[/App1ServerIP:10023]]{client=true, data=false, master=false},}
Mar 16, 2017 8:15:05 PM cluster.service
Mar 16, 2017 8:15:15 PM cluster.service
INFO: [se-data-node-02] detected_master [se-node-master][C2GM0d21ROCEwg9xlgC2rQ……
Client node connects to the cluster, but gets a timeout exception after two hours – repeatedly..
INFO: [SysErr]: 22:05:56,107 ERROR [pool-1-thread-1:ImportBrokerLoggingFacade:51]: Error Occurred on Import Broker during Import request.
INFO: [SysErr]: org.elasticsearch.ElasticsearchTimeoutException: Timeout waiting for task.
INFO: [SysErr]: at org.elasticsearch.action.support.AdapterActionFuture.actionGet(AdapterActionFuture.java:74)
Wondering if anyone has thoughts on this - where should we start looking? Thanks!