Cluster Split Brain

phobos182 · September 13, 2011, 6:43pm

We have been having issues with our cluster split braining a few times this week. Here is the error logs. Anything I can take a look at?

http://www.pastie.org/2527719
http://www.pastie.org/pastes/2527779

kimchy · September 13, 2011, 8:12pm

Can you check the logs, see why the nodes disconnect (you should see log
messages on nodes being disconnected).

On Tue, Sep 13, 2011 at 9:43 PM, phobos182 phobos182@gmail.com wrote:

We have been having issues with our cluster split braining a few times this
week. Here is the error logs. Anything I can take a look at?

http://www.pastie.org/2527719

--
View this message in context:
http://elasticsearch-users.115913.n3.nabble.com/Cluster-Split-Brain-tp3333510p3333510.html
Sent from the Elasticsearch Users mailing list archive at Nabble.com.

phobos182 · September 15, 2011, 3:04pm

It happened again, and this time I have the log files.

192.168.200.110

http://www.pastie.org/pastes/2537779

192.168.200.109

http://www.pastie.org/pastes/2537721

Master

http://www.pastie.org/pastes/2537751
http://www.pastie.org/pastes/2537894

It looks like the "Master" went offline where it could not ping anybody else, or something to that fact. But it came back online shortly. Why would the cluster split brain if one node went down (That happens to be the master)? I'm sure the cluster had quorum to route around a failed node.

It also looks like the master went haywire with the cluster updates during that time. Went from version 900 to 925 in less than a second.

kimchy · September 16, 2011, 6:49pm

I don't see the log messages of the node being disconnected from the
cluster.

On Thu, Sep 15, 2011 at 6:04 PM, phobos182 phobos182@gmail.com wrote:

It happened again, and this time I have the log files.

http://www.pastie.org/pastes/2537721

--
View this message in context:
http://elasticsearch-users.115913.n3.nabble.com/Cluster-Split-Brain-tp3333510p3339227.html
Sent from the Elasticsearch Users mailing list archive at Nabble.com.

phobos182 · September 22, 2011, 12:45pm

I wanted to follow up on this.

It turns out that the version of the CentOS 6 kernel that we were using had an issue with the bnx2 driver for the BroadCom NICs in our servers which would cause very short lived, but temporary outages for the interfaces.

We updated the Kernel drivers, rebooted each node, and have not seen this error since. The tip off was the error message in the logs that a server could not be pinged for 20-30s. Considering we have a very robust 480Gbit/s network core with 10GB LAG's, this pointed directly to the servers.

Topic		Replies	Views
ElasticSearch dropping data, and not joining after split-brain Elasticsearch	3	399	July 6, 2017
EC2 + ZooKeeper Disco: Tips on Simulating Cluster Failures Elasticsearch	2	396	July 6, 2017
Network outage keeps split brain status (no recovery by ES) (was issue #5144) Elasticsearch	7	1086	July 6, 2017
More split brain + recovery issues Elasticsearch	2	328	July 6, 2017
Split brains after long GCs Elasticsearch	3	394	July 6, 2017

Cluster Split Brain

192.168.200.110

192.168.200.109

Master

Related topics