Cluster Split Brain

We have been having issues with our cluster split braining a few times this week. Here is the error logs. Anything I can take a look at?

http://www.pastie.org/2527719
http://www.pastie.org/pastes/2527779

Can you check the logs, see why the nodes disconnect (you should see log
messages on nodes being disconnected).

On Tue, Sep 13, 2011 at 9:43 PM, phobos182 phobos182@gmail.com wrote:

We have been having issues with our cluster split braining a few times this
week. Here is the error logs. Anything I can take a look at?

http://www.pastie.org/2527719

--
View this message in context:
http://elasticsearch-users.115913.n3.nabble.com/Cluster-Split-Brain-tp3333510p3333510.html
Sent from the Elasticsearch Users mailing list archive at Nabble.com.

It happened again, and this time I have the log files.

192.168.200.110

http://www.pastie.org/pastes/2537779

192.168.200.109

http://www.pastie.org/pastes/2537721

Master

http://www.pastie.org/pastes/2537751
http://www.pastie.org/pastes/2537894

It looks like the "Master" went offline where it could not ping anybody else, or something to that fact. But it came back online shortly. Why would the cluster split brain if one node went down (That happens to be the master)? I'm sure the cluster had quorum to route around a failed node.

It also looks like the master went haywire with the cluster updates during that time. Went from version 900 to 925 in less than a second.

I don't see the log messages of the node being disconnected from the
cluster.

On Thu, Sep 15, 2011 at 6:04 PM, phobos182 phobos182@gmail.com wrote:

It happened again, and this time I have the log files.

http://www.pastie.org/pastes/2537721

--
View this message in context:
http://elasticsearch-users.115913.n3.nabble.com/Cluster-Split-Brain-tp3333510p3339227.html
Sent from the Elasticsearch Users mailing list archive at Nabble.com.

I wanted to follow up on this.

It turns out that the version of the CentOS 6 kernel that we were using had an issue with the bnx2 driver for the BroadCom NICs in our servers which would cause very short lived, but temporary outages for the interfaces.

We updated the Kernel drivers, rebooted each node, and have not seen this error since. The tip off was the error message in the logs that a server could not be pinged for 20-30s. Considering we have a very robust 480Gbit/s network core with 10GB LAG's, this pointed directly to the servers.