ElasticSearch cluster split-brain and timeout issues

poobee · March 17, 2017, 10:44am

We are having a few challenges with an ES cluster, have been suffering from some split brain issues, and some client time-outs after 2h 20 mins or so.

This is on CentOS Linux, the cluster is across three servers (Builder, App1, App2)
We are using unicast – so we think no UDP requirement for the firewall?
We are using zen discovery
We have set up tcp keepalive on all three boxes
We have allowed TCP connectivity on our defined ports 10021 all around
We have three nodes, all of which are master eligible:

Builder, master = yes, data =no
App 1 , master = yes, data =yes
App 2 , master = yes, data =yes

Symptoms:

After a period of time, the data nodes App1 or App2 think that the master has disappeared, they do an election, and choose a new master. The reason for the master disappearing seems to be some kind of closing of a TCP session (see blue text below). In the case below se-data-node-02 losses touch with both se-node-master and se-data-node-01 and therefore can’t do an election.
Almost immediately the re-discover the original master. This happens quite a few times

Example App02 log
Mar 16, 2017 8:15:02 PM discovery.zen
INFO: [se-data-node-02] master_left [[se-node-master][C2GM0d21ROCEwg9xlgC2rQ][BuilderServer][inet[/BuilderServerIP:10021]]{data=false, master=true}], reason [transport disconnected (with verified connect)]
Mar 16, 2017 8:15:02 PM cluster.service
INFO: [se-data-node-02] master {new [se-data-node-01][E_FSeTN9TrWPnmN39TRPTg][App1Server][inet[/App1ServerIP:10021]]{master=true}, previous [se-node-master][C2GM0d21ROCEwg9xlgC2rQ][BuilderServer ][inet[/BuilderServerIP:10021]]{data=false, master=true}}, removed {[se-node-master][C2GM0d21ROCEwg9xlgC2rQ][BuilderServer ][inet[/BuilderServerIP:10021]]{data=false, master=true},}, reason: zen-disco-master_failed ([se-node-master][C2GM0d21ROCEwg9xlgC2rQ][BuilderServer ][inet[/BuilderServerIP:10021]]{data=false, master=true})
Mar 16, 2017 8:15:05 PM discovery.zen
INFO: [se-data-node-02] master_left [[se-data-node-01][E_FSeTN9TrWPnmN39TRPTg][App1Server][inet[/App1ServerIP:10021]]{master=true}], reason [no longer master]
Mar 16, 2017 8:15:05 PM discovery.zen
WARNING: [se-data-node-02] not enough master nodes after master left (reason = no longer master), current nodes: {[se-pres-webapp-App1Server][tETV0m3FQaGrM_83Eb9YVw][App1Server][inet[/App1ServerIP:10022]]{data=false, master=false},[se-data-node-02][zZS4guf0TPuQD9OvuPZt6A][App2Server][inet[/App2ServerIP:10021]]{master=true},[vis-export-tool-BuilderServer ][4huaVm6dTrqzkGXLWZpJLA][BuilderServer][inet[/BuilderServerIP:10022]]{client=true, data=false},[viz-webapp-App1Server][H-slsFOvRNqYErdD3v8rLQ][App1Server][inet[/App1ServerIP:10023]]{client=true, data=false, master=false},}
Mar 16, 2017 8:15:05 PM cluster.service
Mar 16, 2017 8:15:15 PM cluster.service
INFO: [se-data-node-02] detected_master [se-node-master][C2GM0d21ROCEwg9xlgC2rQ……

Client node connects to the cluster, but gets a timeout exception after two hours – repeatedly..
INFO: [SysErr]: 22:05:56,107 ERROR [pool-1-thread-1:ImportBrokerLoggingFacade:51]: Error Occurred on Import Broker during Import request.
INFO: [SysErr]: org.elasticsearch.ElasticsearchTimeoutException: Timeout waiting for task.
INFO: [SysErr]: at org.elasticsearch.action.support.AdapterActionFuture.actionGet(AdapterActionFuture.java:74)

Wondering if anyone has thoughts on this - where should we start looking? Thanks!

warkolm · March 17, 2017, 8:57pm

What version?

Also, the regular timing would suggest that it's a firewall with a session limit.

system · April 14, 2017, 8:57pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Frequent disconnects between nodes Elasticsearch	13	2293	July 6, 2017
EC2 + ZooKeeper Disco: Tips on Simulating Cluster Failures Elasticsearch	2	396	July 6, 2017
Split-brain situation - forcing discovery and rejoin Elasticsearch	3	638	July 6, 2017
Cluter removed timeout Coordinating node Elasticsearch	2	228	February 13, 2023
Ping/Zen/minimum_master_nodes and unexpected behaviour Elasticsearch	4	357	July 6, 2017

ElasticSearch cluster split-brain and timeout issues

Related topics