Maan, discussion for this should be done over a beer and not over emails,
its very hard to convey all the different aspects and trying to write short
answers just missed the point, but I will try and write something... .
What I meant when I said you can't get around split brains was actually you
can't avoid network partitions, the question is what you do with it, and if
split brains can be avoided or not. One of the things I don't like (even in
respectable projects) is the "assertions" that this product do that or do
this, especially since CAP relates to point of time ....
First thing to note is the fact that a search engine is very different than
key value or column based storage systems. A typical search query hits huge
amount of data, and you want to keep scoring correct for them without
needing to "repair" each search doc per search.
Another point is to differentiate between what elasticsearch does now, and
what it is designed to do in its final version. It is certainly architected
in a way to solve most things, and behave based on what the use decided to
give up on, but its simply not implemented yet... (I will point what is not
So, one of the main problem when network partitioning happens is that you
can't know if the node that got partitioned is down or not reachable, and
the question is how do you handle it. It gets even worse if that node that
was partitioned got partitioned with clients connected to it, that keep on
working against it. It gets even more interesting in systems that reallocate
shards in the event of node failure (or node getting partitioned out) as
The main problem with the above is the fact that clients (that got
partitioned with the offending node(s)) might still be sending data to them,
and in which case you have two options. The first is to not allow them to
do changes (search might still be ok), and the other is to allow them to
write and reconcile once the network partitioning is resolved.
Lets start with not allowing writes. For that, you need to identify when
writes will not be allowed. This is much simpler to do when you have a fixed
location master(s), since if you lost connection to it (or to a quorum of
them), you basically get into this mode. In elasticsearch there is no fixed
location master OOB, so the other option is to do some sort of quorum across
all the nodes in the cluster. Thats not simple, mainly because a cluster of
elasticsearch might be reduced to a smaller size intentionally. One simple
option to solve this is to have the user define a "minimum" size of cluster
that should be up, and if its not, the nodes will not allow writes. This
will be in elasticsearch and its actually not that difficult to implement.
This actually solves a lot of cases. Assuming you managed to identify it and
block writes, then resolving the network partitioning once it is solved is
simple, you just treat the other nodes as fresh ones.
As a side note, elasticsearch is architected in a way that implementing
fixed location masters should be possible, and then you can easily
implements "big table"/"hadoop" like solution. This will also be available
as an option in the GA elasticsearch version.
Yet another option is to allow writes always, and reconcile changes when a
network partitioning is resolved. This can be solved in elasticsearch case
by adding version/timestamp for each doc indexed, and the reconciliation
will occur when the network partitioning is resolved (and not do read
repair). Two identical shards, one from each network partition area, will
get reconciled based on the changes done to them (either through versioning
/ vector clock / timestamp). The typical problem with this is handling
deletes, usually by adding delete markers (tombstone) but there is full
proof solution for this since you will need to delete those at some point
(its explained, in cassandra case, here:
This is also something that I do want to support in elasticsearch, but more
into the future.
Hope the above make sense, and explain things (I skipped some things,
otherwise this email will need to be turned into a book ). Let me finish
with a personal rant. My problem with CAP theorem is that it seems so
simple, 3 simple rules that only two can be realized , that people make the
mistake of really simplifying distributed systems and missing a lot of "fine
prints" in those systems. My other problem is with the dynamo paper, which
again, really simplifies distributed systems and is, IMO, a patch built on
top of another patch. I would say that if Amazon would have written "dynamo
paper, the revenge of Mr. Vogel", it would have been much much longer ;).
For example, not many people are aware of Cassandra delete handling, and the
fact that they might get lost. You can insert and then delete, and that
delete might get lost because of its deletion handling. Another problem is
that inserts might get lost as well..., with hinted handoff. Or couchdb,
that looses all the "history" of changes once a compaction occurs, so you
can't really resolve changes properly once a compaction has run. And this
list goes on for other solutions.
Or with terractotta, where not using sync writes (which syncs to disk) means
that you might loose data in the event of a failure (as far as I know). Or
if you don't have enough mem to get berkley btree nodes loaded into them. Or
when you go server with hot backup, and what happens when they get
partitioned (what really happens, btw? you are more of a terracotta expert
than myself..., even more interesting is what happens with server arrays).
I am not saying that those are not good products, and I would even say that
you probably have to solve things in the manner that they chose to solve
them, but, those "fine prints" are things that typical users won't know and
assume that those products are "magical" almost as much as the ipad.
On Thu, Jun 17, 2010 at 11:29 AM, Sergio Bossa firstname.lastname@example.org:
On Wed, Jun 16, 2010 at 9:30 PM, Shay Banon
You can't avoid split brain, it will happen once you get network
Not sure about this statement.
While I'm far from being a distributed systems expert, you're probably
way more than me, I think the impossibility result pretty depends on
the system assumptions.
Clearly, it's impossible to avoid split brain in fully asynchronous
networks with lost messages.
But, while you're forced to keep the assumption above for partition
tolerant and available systems, ES doesn't aim at tolerating
partitions (as you stated in your previous mail), so you can actually
avoid split brains by employing quorum algorithms and tolerating only
a given number of "failures" (because consistency needs to be
preserved), blocking the eventually partitioned nodes until they get
You could also choose to embrace partition tolerance over
availability, and adopt the solution above but avoid to block
partitioned nodes and just assume a fail-stop for them, or, assign a
static "owner" to a given index/shard, so that if the owner fails/gets
partitioned all writes to its index/shard will be prohibited to keep
If, on the other hand, clients got
partitioned with (c) as well, then they will continue to work with (c),
while other clients will work with (a) and (b).
This kind of solutions describes indeed an available and partition
tolerant system, because both ends of your partition stay available
and you're actually sacrificing consistency.
Once the network partition is resolved, then some sort of data resolution
needs to occur, either by discarding the small cluster, or by doing
/ conflict resolution.
Which is very hard for indexed data. Moreover, if this needs to be
done manually, users may actually miss inconsistencies or mess up
I honestly thought ES provided some kind of algorithm to avoid
inconsistencies, it doesn't seem to be the case. I'm not saying it's
bad, just that it's different than I expected.
Thanks for your attention,