Sequence numbers to write ops - split brain scenario

Continuing the discussion from github issue.

The current strategy, which seq# will keep enforcing but in easier/faster way, is that all replicas are "reset" to be an exact copy of the primary currently chosen by the master.

The situation will be resolved when the network restores and the left over primary and replica sync I would like to clearly understand the reset/sync scenarios. What triggers reset/sync?

I can think of a couple of "normal" operation scenarios

  1. I would think that whenever a node joins a network, the master would initiate a sync/reset.
  2. If a replica fails for a request, I suppose the primary should keep attempting a sync/reset, otherwise the replica might keep diverging, and at some point the master has to decommission that replica, otherwise the reads would be inconsistent.

In the case of split brain, with multi-network replicas (assuming min master nodes is set), primary-1 has been assuming that this replica R (on this third node, say N-3) has been failing (because of its allegiance to primary-2 ) but still is in the network. Hence it would attempt sync/reset. How does this protocol work? Should master-1 attempt to decommission R at some point, going by assumption (2)?

This problem will occur in a loop if R is decommissioned but another replica is installed on N-3 in its place, by the same protocol. There will be contention on N-3 for "reset"-ing replica shards by both the masters.

I suppose one way to resolve this is by letting a node choose a master if there are multiple masters. If we did this, then whenever a node loses its master, it would choose the other master, and there will be a sync/reset and all is well.

However if the node chooses its master, the other master will lose quorum, and hence cease to exist, which is a good resolution for this issue in my opinion.

In my personal opinion, sequence numbers for shard synchronization are cool, but not enough.

The primary-to-replica transmission is just a part of the whole.

ES docs, when being indexed, should be seen as events in space-time. Assume ES clients, ES nodes, yet all involved shards have different clocks - we think of a perfect world, but the reality is weird - they may not agree about the order of docs they see. Also the underlying OS have a different clock across the distributed hardware. It is just a simplification that all ES processes run on the same flow of events and are naturally synchronized by a global clock - they are not.

Unexpected catastrophic events triggered from outside (network disconnections, slow JVM) reveal that all distributed ES components have somehow to synchronize. But, that is ot just when failures happen. They also have to synchronize when everything works normally - but that work is so tiny that it does not come to our attention.

Without attaching clocks to ES documents / events as they travel, verification/synchronization will be tedious. All the event information should go into the doc's context: client ID (sender) plus local clock, primary node ID (receiver 1) plus local clock, and secondary node IDs (receiver 2..n) plus their local clocks. Document versioning, operation sequence numbers, write consistency - it all boils down to the same challenge, i.e. processing docs in a distributed event stream with synchronizing clocks, i.e. all nodes agreeing about the order of events they see.

@rkonda I'm not following you exactly but I think this where assumptions go wrong:

By configuring min_master_nodes the possibility of a split brain shouldn't be possible. If this happens, this is the bug and our vector of fixing. As I said before a split brain is violating so many assumptions that many things go wrong.

This is exactly what happens - a node can only follow one master at a time and will only switch master if it verified on it's own that it's current master is dead. To become a master a node needs min_master_nodes joins (i.e. votes). You see https://www.elastic.co/elasticon/conf/2016/sf/elasticsearch-and-resiliency for some (but not all :slight_smile: more details )

This is exactly what seq# are - a virtual clock of operation ordering on the primary.

That's depends on the nature of the distributed system. In ES we require that all replicas will eventually expose the latest version of each doc. Replicas process requests out of order all the time. We current use the doc versions to make sure old doesn't override new. In the future we are planning to move to seq#

@bleskes thanks for the reply.

https://github.com/elastic/elasticsearch/issues/2488 talks about split brain scenario even when minimum_master_nodes is set. Is that issue resolved? If that is still an issue, perhaps this protocol isn't working right or there is a bug.

Yes, that issue was closed and the fix included in Elasticsearch 1.4 onwards,