CAP theorem

Hi,

According to the CAP theorem (http://en.wikipedia.org/wiki/CAP_theorem
and http://books.couchdb.org/relax/intro/eventual-consistency),
ElasticSearch can satisfy two of the following:

  • Consistency
  • Availability
  • Partition Tolerance

My guess is Availability and Partition Tolerance are the ones
supported by ElasticSearch and consistency is eventual, meaning that
two clients may sometimes get different results when executing the
same query.
Am I right?

Thanks,
Tal

On Sun, Jun 13, 2010 at 6:51 AM, Tal talsalmona@gmail.com wrote:

My guess is Availability and Partition Tolerance are the ones
supported by Elasticsearch and consistency is eventual, meaning that
two clients may sometimes get different results when executing the
same query.
Am I right?

Nope, AFAIK (please Shay correct me if wrong) Elasticsearch provides
per-document consistency, meaning that writes will be atomically
executed on the "document owner" shard and synchronously replicated to
replica shards.
Regarding A vs P, I honestly don't know how Elasticsearch behaves in
case of network partitions: Shay will surely be able to spread some
light (discussion is interesting indeed).

Cheers,

Sergio B.

--
Sergio Bossa
http://www.linkedin.com/in/sergiob

Hi,

When it comes to CAP, in a very high level, elasticsearch gives up on
partition tolerance. This is for several reasons:

  1. I personally believe that *within the same data center", network
    partitions very rarely happen, and when they do, its a small set (many times
    single) machine that gets "partitioned out of the network". When a single
    machine gets disconnected from the network, then thats not going to affect
    elasticsearch. When it come to cross data centers, a solution that gives up
    on consistency can be built on top of elasticsearch, either by elasticsearch
    (in the future), or now, by using messaging in some form to replicate
    changes between two data centers.

  2. When it comes to search engines, and inverted index, its very hard to the
    point of impossible to build a solution that tries to resolve consistency
    problems on the fly as most products do (the famous "read repair"). When you
    search, you search over a large amount of docs and you can't read repair
    each one all the time. Key value / column based solutions have life easy....
    .

    It might seem confusing and seem like consistency is the one
    elasticsearch chooses to give up on because of the near real time nature of
    it. But its not. The near real time aspect is mainly due to the overhead of
    making changes visible for search, but the changes are there once they are
    performed. Note, there are nice advancements to become full real time. I
    talked to Michael Bush at berlin buzzwords, and he implemented a very nice
    real time solution for solution. It is actually works the same way to
    something I started working on, but now I can wait for it :).

-shay.banon

On Sun, Jun 13, 2010 at 11:27 AM, Sergio Bossa sergio.bossa@gmail.comwrote:

On Sun, Jun 13, 2010 at 6:51 AM, Tal talsalmona@gmail.com wrote:

My guess is Availability and Partition Tolerance are the ones
supported by Elasticsearch and consistency is eventual, meaning that
two clients may sometimes get different results when executing the
same query.
Am I right?

Nope, AFAIK (please Shay correct me if wrong) Elasticsearch provides
per-document consistency, meaning that writes will be atomically
executed on the "document owner" shard and synchronously replicated to
replica shards.
Regarding A vs P, I honestly don't know how Elasticsearch behaves in
case of network partitions: Shay will surely be able to spread some
light (discussion is interesting indeed).

Cheers,

Sergio B.

--
Sergio Bossa
http://www.linkedin.com/in/sergiob

On Mon, Jun 14, 2010 at 2:29 PM, Shay Banon
shay.banon@elasticsearch.com wrote:

  1. I personally believe that *within the same data center", network
    partitions very rarely happen, and when they do, its a small set (many times
    single) machine that gets "partitioned out of the network". When a single
    machine gets disconnected from the network, then thats not going to affect
    elasticsearch.

More specifically, how does ES cope with partitioned nodes?
Say we have 3 nodes (a, b, c) with 'c' as the master, and a partition
happens as follows: (a, b) (c), how does ES behave? How does it avoid
split brain problems?

--
Sergio Bossa
http://www.linkedin.com/in/sergiob

You can't avoid split brain, it will happen once you get network
partitioning, the question is how do you resolve it, and also, what the
effect of the split brain is. In your scenario, once the split will happen,
either (a) or (b) will become master of the sub cluster. If (c) got total
disconnection, then its great, since clients won't be able to see it as
well, so no data will be lost. If, on the other hand, clients got
partitioned with (c) as well, then they will continue to work with (c),
while other clients will work with (a) and (b).

Once the network partition is resolved, then some sort of data resolution
needs to occur, either by discarding the small cluster, or by doing version
/ conflict resolution.

-shay.banon

On Wed, Jun 16, 2010 at 6:06 PM, Sergio Bossa sergio.bossa@gmail.comwrote:

On Mon, Jun 14, 2010 at 2:29 PM, Shay Banon
shay.banon@elasticsearch.com wrote:

  1. I personally believe that *within the same data center", network
    partitions very rarely happen, and when they do, its a small set (many
    times
    single) machine that gets "partitioned out of the network". When a single
    machine gets disconnected from the network, then thats not going to
    affect
    elasticsearch.

More specifically, how does ES cope with partitioned nodes?
Say we have 3 nodes (a, b, c) with 'c' as the master, and a partition
happens as follows: (a, b) (c), how does ES behave? How does it avoid
split brain problems?

--
Sergio Bossa
http://www.linkedin.com/in/sergiob

On Wed, Jun 16, 2010 at 9:30 PM, Shay Banon
shay.banon@elasticsearch.com wrote:

You can't avoid split brain, it will happen once you get network
partitioning,

Not sure about this statement.
While I'm far from being a distributed systems expert, you're probably
way more than me, I think the impossibility result pretty depends on
the system assumptions.
Clearly, it's impossible to avoid split brain in fully asynchronous
networks with lost messages.
But, while you're forced to keep the assumption above for partition
tolerant and available systems, ES doesn't aim at tolerating
partitions (as you stated in your previous mail), so you can actually
avoid split brains by employing quorum algorithms and tolerating only
a given number of "failures" (because consistency needs to be
preserved), blocking the eventually partitioned nodes until they get
reconnected.
You could also choose to embrace partition tolerance over
availability, and adopt the solution above but avoid to block
partitioned nodes and just assume a fail-stop for them, or, assign a
static "owner" to a given index/shard, so that if the owner fails/gets
partitioned all writes to its index/shard will be prohibited to keep
consistency.

If, on the other hand, clients got
partitioned with (c) as well, then they will continue to work with (c),
while other clients will work with (a) and (b).

This kind of solutions describes indeed an available and partition
tolerant system, because both ends of your partition stay available
and you're actually sacrificing consistency.

Once the network partition is resolved, then some sort of data resolution
needs to occur, either by discarding the small cluster, or by doing version
/ conflict resolution.

Which is very hard for indexed data. Moreover, if this needs to be
done manually, users may actually miss inconsistencies or mess up
things ...

I honestly thought ES provided some kind of algorithm to avoid
inconsistencies, it doesn't seem to be the case. I'm not saying it's
bad, just that it's different than I expected.

Thanks for your attention,

Sergio B.

--
Sergio Bossa
http://www.linkedin.com/in/sergiob

Maan, discussion for this should be done over a beer and not over emails,
its very hard to convey all the different aspects and trying to write short
answers just missed the point, but I will try and write something... .

What I meant when I said you can't get around split brains was actually you
can't avoid network partitions, the question is what you do with it, and if
split brains can be avoided or not. One of the things I don't like (even in
respectable projects) is the "assertions" that this product do that or do
this, especially since CAP relates to point of time ....

First thing to note is the fact that a search engine is very different than
key value or column based storage systems. A typical search query hits huge
amount of data, and you want to keep scoring correct for them without
needing to "repair" each search doc per search.

Another point is to differentiate between what elasticsearch does now, and
what it is designed to do in its final version. It is certainly architected
in a way to solve most things, and behave based on what the use decided to
give up on, but its simply not implemented yet... (I will point what is not
implemented later).

So, one of the main problem when network partitioning happens is that you
can't know if the node that got partitioned is down or not reachable, and
the question is how do you handle it. It gets even worse if that node that
was partitioned got partitioned with clients connected to it, that keep on
working against it. It gets even more interesting in systems that reallocate
shards in the event of node failure (or node getting partitioned out) as
elasticsearch does.

The main problem with the above is the fact that clients (that got
partitioned with the offending node(s)) might still be sending data to them,
and in which case you have two options. The first is to not allow them to
do changes (search might still be ok), and the other is to allow them to
write and reconcile once the network partitioning is resolved.

Lets start with not allowing writes. For that, you need to identify when
writes will not be allowed. This is much simpler to do when you have a fixed
location master(s), since if you lost connection to it (or to a quorum of
them), you basically get into this mode. In elasticsearch there is no fixed
location master OOB, so the other option is to do some sort of quorum across
all the nodes in the cluster. Thats not simple, mainly because a cluster of
elasticsearch might be reduced to a smaller size intentionally. One simple
option to solve this is to have the user define a "minimum" size of cluster
that should be up, and if its not, the nodes will not allow writes. This
will be in elasticsearch and its actually not that difficult to implement.
This actually solves a lot of cases. Assuming you managed to identify it and
block writes, then resolving the network partitioning once it is solved is
simple, you just treat the other nodes as fresh ones.

As a side note, elasticsearch is architected in a way that implementing
fixed location masters should be possible, and then you can easily
implements "big table"/"hadoop" like solution. This will also be available
as an option in the GA elasticsearch version.

Yet another option is to allow writes always, and reconcile changes when a
network partitioning is resolved. This can be solved in elasticsearch case
by adding version/timestamp for each doc indexed, and the reconciliation
will occur when the network partitioning is resolved (and not do read
repair). Two identical shards, one from each network partition area, will
get reconciled based on the changes done to them (either through versioning
/ vector clock / timestamp). The typical problem with this is handling
deletes, usually by adding delete markers (tombstone) but there is full
proof solution for this since you will need to delete those at some point
(its explained, in cassandra case, here:
http://wiki.apache.org/cassandra/DistributedDeletes).

This is also something that I do want to support in elasticsearch, but more
into the future.

Hope the above make sense, and explain things (I skipped some things,
otherwise this email will need to be turned into a book :wink: ). Let me finish
with a personal rant. My problem with CAP theorem is that it seems so
simple, 3 simple rules that only two can be realized , that people make the
mistake of really simplifying distributed systems and missing a lot of "fine
prints" in those systems. My other problem is with the dynamo paper, which
again, really simplifies distributed systems and is, IMO, a patch built on
top of another patch. I would say that if Amazon would have written "dynamo
paper, the revenge of Mr. Vogel", it would have been much much longer ;).

For example, not many people are aware of Cassandra delete handling, and the
fact that they might get lost. You can insert and then delete, and that
delete might get lost because of its deletion handling. Another problem is
that inserts might get lost as well..., with hinted handoff. Or couchdb,
that looses all the "history" of changes once a compaction occurs, so you
can't really resolve changes properly once a compaction has run. And this
list goes on for other solutions.

Or with terractotta, where not using sync writes (which syncs to disk) means
that you might loose data in the event of a failure (as far as I know). Or
if you don't have enough mem to get berkley btree nodes loaded into them. Or
when you go server with hot backup, and what happens when they get
partitioned (what really happens, btw? you are more of a terracotta expert
than myself..., even more interesting is what happens with server arrays).

I am not saying that those are not good products, and I would even say that
you probably have to solve things in the manner that they chose to solve
them, but, those "fine prints" are things that typical users won't know and
assume that those products are "magical" almost as much as the ipad.

-shay.banon

On Thu, Jun 17, 2010 at 11:29 AM, Sergio Bossa sergio.bossa@gmail.comwrote:

On Wed, Jun 16, 2010 at 9:30 PM, Shay Banon
shay.banon@elasticsearch.com wrote:

You can't avoid split brain, it will happen once you get network
partitioning,

Not sure about this statement.
While I'm far from being a distributed systems expert, you're probably
way more than me, I think the impossibility result pretty depends on
the system assumptions.
Clearly, it's impossible to avoid split brain in fully asynchronous
networks with lost messages.
But, while you're forced to keep the assumption above for partition
tolerant and available systems, ES doesn't aim at tolerating
partitions (as you stated in your previous mail), so you can actually
avoid split brains by employing quorum algorithms and tolerating only
a given number of "failures" (because consistency needs to be
preserved), blocking the eventually partitioned nodes until they get
reconnected.
You could also choose to embrace partition tolerance over
availability, and adopt the solution above but avoid to block
partitioned nodes and just assume a fail-stop for them, or, assign a
static "owner" to a given index/shard, so that if the owner fails/gets
partitioned all writes to its index/shard will be prohibited to keep
consistency.

If, on the other hand, clients got
partitioned with (c) as well, then they will continue to work with (c),
while other clients will work with (a) and (b).

This kind of solutions describes indeed an available and partition
tolerant system, because both ends of your partition stay available
and you're actually sacrificing consistency.

Once the network partition is resolved, then some sort of data resolution
needs to occur, either by discarding the small cluster, or by doing
version
/ conflict resolution.

Which is very hard for indexed data. Moreover, if this needs to be
done manually, users may actually miss inconsistencies or mess up
things ...

I honestly thought ES provided some kind of algorithm to avoid
inconsistencies, it doesn't seem to be the case. I'm not saying it's
bad, just that it's different than I expected.

Thanks for your attention,

Sergio B.

--
Sergio Bossa
http://www.linkedin.com/in/sergiob

On Thu, Jun 17, 2010 at 2:37 PM, Shay Banon
shay.banon@elasticsearch.com wrote:

Maan, discussion for this should be done over a beer and not over emails,
its very hard to convey all the different aspects and trying to write short
answers just missed the point, but I will try and write something... .

I know, I know, but "a few" miles separate us, so no beers and chats,
and the only things remaining are emails :wink:

You gave a very satisfying and informative answer, and I think it's
actually very important for ES end users: because they may think ES is
magical (like the iPad as you said), but it isn't.
It currently sacrifice consistency in case of partitions in the terms
you explained, and you are planning to implement a bunch of cool
features to make ES suit different needs: those are all important bits
of information, and I think users will be grateful for them. At least,
I am :wink:

Thanks,
Cheers!

Sergio B.

--
Sergio Bossa
http://www.linkedin.com/in/sergiob

Well, next time I am in Rome (well, never been, so first time I will be
there :wink: )... . Happy the answers make sense, btw, you did not answer
regarding the terracotta ones, this is something that I always wanted to
know about but could not find anything in the docs... .

-shay.banon

On Thu, Jun 17, 2010 at 7:07 PM, Sergio Bossa sergio.bossa@gmail.comwrote:

On Thu, Jun 17, 2010 at 2:37 PM, Shay Banon
shay.banon@elasticsearch.com wrote:

Maan, discussion for this should be done over a beer and not over emails,
its very hard to convey all the different aspects and trying to write
short
answers just missed the point, but I will try and write something... .

I know, I know, but "a few" miles separate us, so no beers and chats,
and the only things remaining are emails :wink:

You gave a very satisfying and informative answer, and I think it's
actually very important for ES end users: because they may think ES is
magical (like the iPad as you said), but it isn't.
It currently sacrifice consistency in case of partitions in the terms
you explained, and you are planning to implement a bunch of cool
features to make ES suit different needs: those are all important bits
of information, and I think users will be grateful for them. At least,
I am :wink:

Thanks,
Cheers!

Sergio B.

--
Sergio Bossa
http://www.linkedin.com/in/sergiob

On Sat, Jun 19, 2010 at 11:55 PM, Shay Banon
shay.banon@elasticsearch.com wrote:

Well, next time I am in Rome (well, never been, so first time I will be
there :wink: )... .

Anytime :wink:

Happy the answers make sense, btw, you did not answer
regarding the terracotta ones, this is something that I always wanted to
know about but could not find anything in the docs... .

Sorry, missed your questions :slight_smile:
Anyways, Terracotta should work as follows:

  1. In case of client or server failure, using async writes, there's no
    data loss provided you run an active/passive pair: the passive one
    will take the transaction over and complete it as the new active.
  2. In case of active/passive server partitioning, the currently active
    one will keep its clients connected with, while the passive one will
    elect itself as a master, but with no attached clients, and there will
    be so no split brain; once the partition heals again, the server which
    kept the attached clients will zap the other one and downshift it to
    passive state.
    In the end, you could have a split brain only if you had one server
    and a bunch of clients on one switch, and another server and another
    bunch of clients on another switch, and the switches get partitioned
    ... a pretty bizarre network configuration, provided you're not
    running in the cloud ... so, Terracotta also has its own split brain
    vulnerabilities, which are IMHO less common than the ES ones, but
    Terracotta is master based so it's easy to manage coordination, while
    ES is decentralized and yadda-yadda-yadda ... you get the idea :slight_smile:

Hope that answers your questions ... feel free to ask more obviously :wink:
Cheers!

Sergio B.

--
Sergio Bossa
http://www.linkedin.com/in/sergiob

Interesting. First of all, in most cases, you have both es clients ("native
clients") and the nodes within the same network switch, so the changes are
identical to terracotta for split brain with clients.

So, now I understand, TC has sync replication between active and passive
servers, and writing to disk (in async manner, I presume).

Yet another question in the TC case then. Lets assume you have two server,
active and passive, both, I assume, write to the local disk their state.
Now, you bring down the active server, the passive becomes active (master),
and clients starts writing to new active server.

Now, I bring down the last server, which is active, and afterwards, start
the first server, which, I assume, becomes active and starts to receive
client requests. Now, that server has an old view of the data, since clients
have been performing changes to the other server while it was down. How does
TC recover from that?

-shay.banon

On Sun, Jun 20, 2010 at 4:37 PM, Sergio Bossa sergio.bossa@gmail.comwrote:

On Sat, Jun 19, 2010 at 11:55 PM, Shay Banon
shay.banon@elasticsearch.com wrote:

Well, next time I am in Rome (well, never been, so first time I will be
there :wink: )... .

Anytime :wink:

Happy the answers make sense, btw, you did not answer
regarding the terracotta ones, this is something that I always wanted to
know about but could not find anything in the docs... .

Sorry, missed your questions :slight_smile:
Anyways, Terracotta should work as follows:

  1. In case of client or server failure, using async writes, there's no
    data loss provided you run an active/passive pair: the passive one
    will take the transaction over and complete it as the new active.
  2. In case of active/passive server partitioning, the currently active
    one will keep its clients connected with, while the passive one will
    elect itself as a master, but with no attached clients, and there will
    be so no split brain; once the partition heals again, the server which
    kept the attached clients will zap the other one and downshift it to
    passive state.
    In the end, you could have a split brain only if you had one server
    and a bunch of clients on one switch, and another server and another
    bunch of clients on another switch, and the switches get partitioned
    ... a pretty bizarre network configuration, provided you're not
    running in the cloud ... so, Terracotta also has its own split brain
    vulnerabilities, which are IMHO less common than the ES ones, but
    Terracotta is master based so it's easy to manage coordination, while
    ES is decentralized and yadda-yadda-yadda ... you get the idea :slight_smile:

Hope that answers your questions ... feel free to ask more obviously :wink:
Cheers!

Sergio B.

--
Sergio Bossa
http://www.linkedin.com/in/sergiob