Elasticsearch and the CAP theorem

Consistency is not given up by ES. First, on doc level, you have "write
your own read" consistency implemented as versioning - a doc read followed
by a doc write is guaranteed to be consistent if both read and write
versions match (MVCC). Write-your-read consistency is only eventually
consistent because other clients can not be sure when to read the old and
when to read the new value, and this is different from for example ACID
(causal consistency). But it's just another model of consistency.

And second, on index level, after a node crash, ES tries an index recovery
that should end up always in a consistent index state. This is possible
because of the WAL (translog) at each shard.

If ES gave up on consistency, there would be no doc versioning and no index
recovery.

Replica are not interfering with consistency, they are for availability.
The higher the replica level , the higher the probability that an index is
still available although a number of nodes are faulty. Replica level 0 (no
replica) is reducing availability, if just one node fails, availability has
gone.

If primary shard and replica shards differ in their responses to queries,
that should be considered as a bug. Maybe a recovery did not work out right.

ES gives up on partition tolerance. Just a few observations: split brains
can happen and ES can happily proceed reading and writing to the index in
such a case, but the result is not predictable - the usual case is that two
masters are going to control two divergent indices, and that is
catastrophic. This is not a fault but a (nasty) feature, and must be
controlled by extra safeguarding, by setting the minimum master value in
the configuration - for example, if this config is set over the quorum,
more than half of the nodes must be started before a master is elected and
the cluster is formed, so the probability is extremely low for a split
brain, but the probability is not 0 unless minimum master is equal to the
number of all master eligible nodes, which in other words disables
partition tolerance completely.

Also, another observation of mine, the discovery is detecting node
failures, but not network failures. The master node "pings" all other nodes
(under the assumption ES is always running on an always available network)
each 5 secs. If a node does not answer, the cluster state changes and marks
this node as not available. With the current algorithm, it is not possible
for ES to decide if it was the network or the node that failed. And this
makes sense if ES is already giving up on partition tolerance, so it simply
does not matter if it is a node or a network fault.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoE_Y2rbS5o0OaLoyz-xD1FEo%2BbDwyMsKVAFO_ohPyMAEQ%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.