Elasticsearch and the CAP theorem

nicolaslong · January 3, 2014, 7:46pm

I've been looking into CAP recently and wanted to develop my understanding
of the various tradeoffs and failure modes of Elasticsearch as a
distributed system.

I came across this post from a while back in which Kimchy (Shay) suggests
that ES gives up on partition tolerance, i.e. it chooses Consistency and
Availability out of CAP:

http://elasticsearch-users.115913.n3.nabble.com/CAP-theorem-td891925.html#a894234

From my understanding it seems like Kimchy was confused here. As a
distributed system ES can't give up on the P - you can't will
network/communication failures out of existent!

Instead, it seems like ES mostly compromises on the A (availability) part
of CAP. For example, unless you are willing to suffer potential split-brain
scenarios, setting min master nodes to n/2 + 1 will mean the smaller group
under a network partition will become unavailable (it will not respond to
read/writes). If you do allow split-brain then clearly consistency is
compromised and the client service will need to have some kind of conflict
resolution mechanism.

There are, of course, lots more nuances here.

It would be great if there were a page on the ES site/guide which went into
these issues in more detail as it is (IMO) essential information in
understanding how ES works and in deciding whether it is appropriate for
your use case. Ideally this page would give a general overview of ES
architecture:

replication behaviour
how requests are routed (and which nodes can handle requests)
how index operations are handled
how get and search requests are handled
how ES deals with background tasks and resource contention

(Some of this information exists on the site, but it is scattered about and
in any case not very detailed from what I can find.)

By providing this information, it could then discuss:

how ES approaches consistency, cases where data inconsistency can arise
(for example, what happens if two clients simultaneously update a piece of
content?, etc.), and it's approach to conflict resolution
how ES approaches availability

I'm currently working on writing a blog post on these issues. If it ends up
sufficiently detailed (and turns out accurate enough!) I'd be happy for it
to be added to the docs.

But it would be incredibly useful for someone knowledgeable to either check
what I write, or produce something themselves and is it frankly surprising
I can't find this information anywhere myself (without trawling the web for
scattered pieces of information, and also going through the code/testing a
live system).

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/b88fc6ac-024c-4a66-a95f-b1fd86a686e4%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

jprante · January 3, 2014, 9:17pm

ES gives up on partition tolerance, it means, if enough nodes fail, cluster
state turns red and ES does not proceed to operate on that index.

ES is not giving up on availability. Every request will be responded,
either true (with result) or false (error). In a system being not
available, you would have to expect the property of having some requests
that can no longer be answered at all (they hang forever or the responder
is gone).

The principle design of distributed operations in ES is like this: write
all ops on an index into a WAL (the translog). Send the ops to the nodes
while even some nodes may work reliable, some not. Stopping a node does not
harm the cluster as long as the replica level is high enough. When a
stopped node rejoins, initiate a recovery, using the WAL. Let the "best"
WAL result of all consistent results of the replica win for recovering the
index state.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoGMRsk_hbtQHOEfWGqmigyQ1SP3VYkx9QgBKLJUfdzzhA%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

nicolaslong · January 3, 2014, 9:30pm

Hi Jörg,

thanks for the reply. Very interesting.

So ES is not fully P.

But it is also not fully A, because A means that all nodes (or more
strictly, all partitions) continue to be available even in the event of
network failure - i.e. availability on either side of a partition.

And it is also not fully C depending on configuration (for the reasons
discussed above).

None of this is a criticism - while CAP is often presented as 'pick two of
the three', most real-life distributed systems are more nuanced. For
example, they make a 'best effort' at different parts of the triangle in
different circumstances to meet the needs of their target domain.

But it would still be great to have these design decisions explicitly
recognised and discussed somewhere in the docs.

If there are any write-ups relating to this kind of discussion please send
them my way - I'd love to read them.

I'll continue on my own write-up too.

Thanks again,

Nic

On Friday, January 3, 2014 9:17:31 PM UTC, Jörg Prante wrote:

ES gives up on partition tolerance, it means, if enough nodes fail,
cluster state turns red and ES does not proceed to operate on that index.

ES is not giving up on availability. Every request will be responded,
either true (with result) or false (error). In a system being not
available, you would have to expect the property of having some requests
that can no longer be answered at all (they hang forever or the responder
is gone).

The principle design of distributed operations in ES is like this: write
all ops on an index into a WAL (the translog). Send the ops to the nodes
while even some nodes may work reliable, some not. Stopping a node does not
harm the cluster as long as the replica level is high enough. When a
stopped node rejoins, initiate a recovery, using the WAL. Let the "best"
WAL result of all consistent results of the replica win for recovering the
index state.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/81ce51ef-a0b9-4471-86b5-032c34f85919%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

nicolaslong · January 3, 2014, 9:42pm

I should probably clarify my last post. ES does of course meet availability
if we assume it fully gives up on partition tolerance.

But in practice it's more mixed because it does attempt to tolerate some
level of partition (for example, depending on the size of the cluster and
configuration, it may allow for some level of communication failure/node
disconnects) in which case availability does get compromised.

So attempting to force it into a two-of-the-three model is not necessarily
that informative. While most ES users might agree with Kimchy that
substantial network failures are rare, a single node or data centre being
temporarily partitioned is likely to be a real possibility for some users.

On Friday, January 3, 2014 9:30:04 PM UTC, nicol...@gmail.com wrote:

Hi Jörg,

thanks for the reply. Very interesting.

So ES is not fully P.

But it is also not fully A, because A means that all nodes (or more
strictly, all partitions) continue to be available even in the event of
network failure - i.e. availability on either side of a partition.

And it is also not fully C depending on configuration (for the reasons
discussed above).

None of this is a criticism - while CAP is often presented as 'pick two of
the three', most real-life distributed systems are more nuanced. For
example, they make a 'best effort' at different parts of the triangle in
different circumstances to meet the needs of their target domain.

But it would still be great to have these design decisions explicitly
recognised and discussed somewhere in the docs.

If there are any write-ups relating to this kind of discussion please send
them my way - I'd love to read them.

I'll continue on my own write-up too.

Thanks again,

Nic

On Friday, January 3, 2014 9:17:31 PM UTC, Jörg Prante wrote:

ES gives up on partition tolerance, it means, if enough nodes fail,
cluster state turns red and ES does not proceed to operate on that index.

ES is not giving up on availability. Every request will be responded,
either true (with result) or false (error). In a system being not
available, you would have to expect the property of having some requests
that can no longer be answered at all (they hang forever or the responder
is gone).

The principle design of distributed operations in ES is like this: write
all ops on an index into a WAL (the translog). Send the ops to the nodes
while even some nodes may work reliable, some not. Stopping a node does not
harm the cluster as long as the replica level is high enough. When a
stopped node rejoins, initiate a recovery, using the WAL. Let the "best"
WAL result of all consistent results of the replica win for recovering the
index state.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/774f87b1-0e7c-45f5-aa6a-8e1d0cac4bb5%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

jprante · January 3, 2014, 11:22pm

Don't worry, CAP has nuances, for example, it does neglect latency in
networks.

Brewer made some remarks about his CAP theorem:

More recent work is focused on CRDT

http://pagesperso-systeme.lip6.fr/Marc.Shapiro/papers/RR-6956.pdf

or CALM

http://db.cs.berkeley.edu/jmh/calm-cidr-short.pdf

which in my eyes give interesting instruments for discussing eventual
consistency in distributed systems.

There are not much texts about CALM and CRDT but I find this well written:

http://book.mixu.net/distsys/index.html

And this text I enjoyed lately, regarding the efforts to make Redis a
cluster and why the WAIT proposal must fail (not related to ES but it shows
how hard it is to build distributed systems):

http://aphyr.com/posts/309-knossos-redis-and-linearizability

There are many aspects in that text I would also like to see applied
somehow to ES algorithms, for example to the question if ES can always
recover an index successfully by translog-based conflict resolution.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoHVa7XSLBmXsGKHprkeyJMr2H5dMGE7Vi%3DpH5AvVxmAyQ%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

nicolaslong · January 4, 2014, 12:10pm

Hi Jörg, thanks for the reply. I saw a great talk about CRDTs at the London
Scala Exchange but haven't read either of the papers you sent - they look
worth a read.

Aphyr's work on distributed systems and failure cases is incredible.
Unfortunately he never did a blog on Elasticsearch because that would have
been fascinating to see!

On Friday, January 3, 2014 11:22:53 PM UTC, Jörg Prante wrote:

Don't worry, CAP has nuances, for example, it does neglect latency in
networks.

Brewer made some remarks about his CAP theorem:

CAP Twelve Years Later: How the "Rules" Have Changed

More recent work is focused on CRDT

http://pagesperso-systeme.lip6.fr/Marc.Shapiro/papers/RR-6956.pdf

or CALM

http://db.cs.berkeley.edu/jmh/calm-cidr-short.pdf

which in my eyes give interesting instruments for discussing eventual
consistency in distributed systems.

There are not much texts about CALM and CRDT but I find this well written:

Distributed systems for fun and profit

And this text I enjoyed lately, regarding the efforts to make Redis a
cluster and why the WAIT proposal must fail (not related to ES but it shows
how hard it is to build distributed systems):

Knossos: Redis and linearizability

There are many aspects in that text I would also like to see applied
somehow to ES algorithms, for example to the question if ES can always
recover an index successfully by translog-based conflict resolution.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/0b6f6f0b-ea7f-493a-bfe6-f82892a75a06%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Itamar_Syn_Hershko · January 5, 2014, 1:09am

Jeorg, I'm afraid you are very wrong here

Every distributed system design starts with the assumption that you don't
want to compromise on Partition Tolerance, and therefore have to decide
whether you want full Availability or full Consistency. No distributed
system really gives up on P.

The very fact that you can have enough replicas of your index which will
make the cluster never get to a red state (e.g. the number of nodes you
have) proves this. An index (an eventually, all indexes on your cluster)
can survive a network split. It can also be always available, hence ES is
AP.

Elasticsearch's compromise is on C - consistency - like most NoSQL
databases. It uses Eventual Consistency to answer queries, not just because
of NRT search, but also because you may be querying a replica (a slave
node) which hasn't been brought up to speed yet.

--

Itamar Syn-Hershko
http://code972.com | @synhershko https://twitter.com/synhershko
Freelance Developer & Consultant
Author of RavenDB in Action http://manning.com/synhershko/

On Fri, Jan 3, 2014 at 11:17 PM, joergprante@gmail.com <
joergprante@gmail.com> wrote:

ES gives up on partition tolerance, it means, if enough nodes fail,
cluster state turns red and ES does not proceed to operate on that index.

ES is not giving up on availability. Every request will be responded,
either true (with result) or false (error). In a system being not
available, you would have to expect the property of having some requests
that can no longer be answered at all (they hang forever or the responder
is gone).

The principle design of distributed operations in ES is like this: write
all ops on an index into a WAL (the translog). Send the ops to the nodes
while even some nodes may work reliable, some not. Stopping a node does not
harm the cluster as long as the replica level is high enough. When a
stopped node rejoins, initiate a recovery, using the WAL. Let the "best"
WAL result of all consistent results of the replica win for recovering the
index state.

Jörg

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoGMRsk_hbtQHOEfWGqmigyQ1SP3VYkx9QgBKLJUfdzzhA%40mail.gmail.com
.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAHTr4ZurPtnvpyc1jh3C4iPKYtUye%3DagZcR99zYyjOLSTw6ZgA%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

jprante · January 5, 2014, 1:51pm

Consistency is not given up by ES. First, on doc level, you have "write
your own read" consistency implemented as versioning - a doc read followed
by a doc write is guaranteed to be consistent if both read and write
versions match (MVCC). Write-your-read consistency is only eventually
consistent because other clients can not be sure when to read the old and
when to read the new value, and this is different from for example ACID
(causal consistency). But it's just another model of consistency.

And second, on index level, after a node crash, ES tries an index recovery
that should end up always in a consistent index state. This is possible
because of the WAL (translog) at each shard.

If ES gave up on consistency, there would be no doc versioning and no index
recovery.

Replica are not interfering with consistency, they are for availability.
The higher the replica level , the higher the probability that an index is
still available although a number of nodes are faulty. Replica level 0 (no
replica) is reducing availability, if just one node fails, availability has
gone.

If primary shard and replica shards differ in their responses to queries,
that should be considered as a bug. Maybe a recovery did not work out right.

ES gives up on partition tolerance. Just a few observations: split brains
can happen and ES can happily proceed reading and writing to the index in
such a case, but the result is not predictable - the usual case is that two
masters are going to control two divergent indices, and that is
catastrophic. This is not a fault but a (nasty) feature, and must be
controlled by extra safeguarding, by setting the minimum master value in
the configuration - for example, if this config is set over the quorum,
more than half of the nodes must be started before a master is elected and
the cluster is formed, so the probability is extremely low for a split
brain, but the probability is not 0 unless minimum master is equal to the
number of all master eligible nodes, which in other words disables
partition tolerance completely.

Also, another observation of mine, the discovery is detecting node
failures, but not network failures. The master node "pings" all other nodes
(under the assumption ES is always running on an always available network)
each 5 secs. If a node does not answer, the cluster state changes and marks
this node as not available. With the current algorithm, it is not possible
for ES to decide if it was the network or the node that failed. And this
makes sense if ES is already giving up on partition tolerance, so it simply
does not matter if it is a node or a network fault.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoE_Y2rbS5o0OaLoyz-xD1FEo%2BbDwyMsKVAFO_ohPyMAEQ%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

jprante · January 6, 2014, 2:49pm

Maybe soon he will...

github.com/elastic/elasticsearch

minimum_master_nodes does not prevent split-brain if splits are intersecting

opened 08:15AM - 17 Dec 12 UTC

closed 02:58PM - 01 Sep 14 UTC

saj

>bug v2.0.0-beta1 v1.4.0.Beta1

G'day, I'm using ElasticSearch 0.19.11 with the unicast Zen discovery protocol.… With this setup, I can easily split a 3-node cluster into two 'hemispheres' (continuing with the brain metaphor) with one node acting as a participant in both hemispheres. I believe this to be a significant problem, because now `minimum_master_nodes` is incapable of preventing certain split-brain scenarios. Here's what my 3-node test cluster looked like before I broke it: ![](https://saj.beta.anchortrove.com/es-splitbrain-1.png) Here's what the cluster looked like after simulating a communications failure between nodes (2) and (3): ![](https://saj.beta.anchortrove.com/es-splitbrain-2.png) Here's what seems to have happened immediately after the split: 1. Node (2) and (3) lose contact with one another. (`zen-disco-node_failed` ... `reason failed to ping`) 2. Node (2), still master of the left hemisphere, notes the disappearance of node (3) and broadcasts an advisory message to all of its followers. Node (1) takes note of the advisory. 3. Node (3) has now lost contact with its old master and decides to hold an election. It declares itself winner of the election. On declaring itself, it assumes master role of the right hemisphere, then broadcasts an advisory message to all of its followers. Node (1) takes note of this advisory, too. At this point, I can't say I know what to expect to find on node (1). If I query both masters for a list of nodes, I see node (1) in both clusters. Let's look at `minimum_master_nodes` as it applies to this test cluster. Assume I had set `minimum_master_nodes` to 2. Had node (3) been completely isolated from nodes (1) and (2), I would not have run into this problem. The left hemisphere would have enough nodes to satisfy the constraint; the right hemisphere would not. This would continue to work for larger clusters (with an appropriately larger value for `minimum_master_nodes`). The problem with `minimum_master_nodes` is that it does not work when the split brains are intersecting, as in my example above. Even on a larger cluster of, say, 7 nodes with `minimum_master_nodes` set to 4, all that needs to happen is for the 'right' two nodes to lose contact with one another (a master election has to take place) for the cluster to split. Is there anything that can be done to detect the intersecting split on node (1)? Would #1057 help? Am I missing something obvious? :)

Jörg

On Sat, Jan 4, 2014 at 1:10 PM, nicolaslong@gmail.com wrote:

Aphyr's work on distributed systems and failure cases is incredible.
Unfortunately he never did a blog on Elasticsearch because that would have
been fascinating to see!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoHt6BLfUJCdQNvv%2B7Cg33UQkHKem4xd_cLO%3DovhT-byaQ%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

jasontedor · April 18, 2017, 12:20pm

Please do not ping people directly. If there's a specific question that you have, I would recommend starting a new topic. There are several members of the community here that are interested in distributed systems that would be happy to engage with you (myself included)!

Topic		Replies	Views
CAP theorem Elasticsearch	11	6414	July 6, 2017
Which side of CAP theorem Elasticsearch satisfy? Elasticsearch	2	6290	May 20, 2019
Fault Tolerance Fallacy Elasticsearch	9	468	July 6, 2017
Architecture Elasticsearch	1	267	July 6, 2017
Shard routing and consistency in Elasticsearch Elasticsearch	3	1633	February 3, 2021

Elasticsearch and the CAP theorem

Related topics