Elastic Search and consistency


(badcase) #1

Hi,

We are trying to use ES as a system of record, so I trying to understand
its consistency and durability related guarantees and knobs.

For this scenario assume I will have a 2 copies ( 1 replica). From what I
have read and gleaned (
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/docs-index_.html#index-consistency)
=>

  • write.consistency is a check to see if the operation can be executed.
    It offers no guarantees in terms of data durability or operation guarantee
  • replication type => Is this for durability guarantees?
    • sync all copies of the data are saved/indexed on nodes before it
      returns.
    • async primary node will be synchronous the others will be async.

Failure scenario question

Scenario is

=> PUT with write.consistency=quorum, but different replications =
sync/async.
=> What happens if there is a network glitch after the quorum check
succeeds, such that one of the non-primary shard machines is not reachable
or down?

  1. Put with write.consistency=quorum and replication=sync =>
    1. Would the operation fail from the callers perspective? Seems like
      it would after some retries?
    2. I have read that if the primary succeeds, and if it has problems
      with replica then it will try to create a new shard, is that correct
      understanding? I assume that should be preventable to handle network
      failures/partitions?
  2. Put with write.consistency=quorum and replication=async =>
    1. The operation succeeds, since primary shard is up.
    2. Now in sometime the unreachable node is back, how does it get that
      copy of data that it did not get, is there some steps/process
      that will do
      that automatically?

If these have been asked/answered or documents, please point me to a source
and my apologies.

Thanks
Anand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CANN9sNpFY39T2t2uy51Zh0m1p%3DBpqtqb7f9ho1iWTV%3DeUab23Q%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(shikhar) #2

Hi Anand,

I would love to know as well, and asked in this thread today: "quorum write

  • sync replication: guarantees".

It is pretty important information given ElasticSearch is being touted as a
reliable data store... which the presence of a "write consistency level"
setting would tempt one to believe, but there doesn't seem to be evidence
to support the guarantees that implies.

Cheers,

Shikhar

On Sat, Apr 5, 2014 at 12:06 AM, Anand Somani meatforums@gmail.com wrote:

Hi,

We are trying to use ES as a system of record, so I trying to understand
its consistency and durability related guarantees and knobs.

For this scenario assume I will have a 2 copies ( 1 replica). From what I
have read and gleaned (
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/docs-index_.html#index-consistency)
=>

  • write.consistency is a check to see if the operation can be
    executed. It offers no guarantees in terms of data durability or operation
    guarantee
  • replication type => Is this for durability guarantees?
    • sync all copies of the data are saved/indexed on nodes before it
      returns.
    • async primary node will be synchronous the others will be async.

Failure scenario question

Scenario is

=> PUT with write.consistency=quorum, but different replications =
sync/async.
=> What happens if there is a network glitch after the quorum check
succeeds, such that one of the non-primary shard machines is not reachable
or down?

  1. Put with write.consistency=quorum and replication=sync =>
    1. Would the operation fail from the callers perspective? Seems
      like it would after some retries?
    2. I have read that if the primary succeeds, and if it has problems
      with replica then it will try to create a new shard, is that correct
      understanding? I assume that should be preventable to handle network
      failures/partitions?
  2. Put with write.consistency=quorum and replication=async =>
    1. The operation succeeds, since primary shard is up.
    2. Now in sometime the unreachable node is back, how does it get
      that copy of data that it did not get, is there some steps/process that
      will do that automatically?

If these have been asked/answered or documents, please point me to a
source and my apologies.

Thanks
Anand

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CANN9sNpFY39T2t2uy51Zh0m1p%3DBpqtqb7f9ho1iWTV%3DeUab23Q%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CANN9sNpFY39T2t2uy51Zh0m1p%3DBpqtqb7f9ho1iWTV%3DeUab23Q%40mail.gmail.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAHWG4DPhhY20_PPDOC%2BDTCsDv%2BddhkNDd-Q5pejeAS-SK3qREg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(Jörg Prante) #3

I think the documentation is quite clear, but I try to explain in my own
words.

1.1 Not sure what you mean "after the quorum check". Write consistency is a
model where ES makes sure there are enough recipients (nodes) before writes
are executed. consistency=quorum fails if you have too few nodes to match
the actual replica count of the index.

1.2. No new shard is created - not sure I understand why such a thing
should happen?

2.1. No, the operation fails early if replica count is too low.
replication=async means, the operation is executed without waiting for
nodes to respond. It does not mean to ignore replica count.

2.2. When a node comes back, index recovery is performed to obtain
consistency. The node holding a replica with intact checksums and youngest
translog "wins" and will send this version of the shard to the other node.

Jörg

On Fri, Apr 4, 2014 at 8:36 PM, Anand Somani meatforums@gmail.com wrote:

Hi,

We are trying to use ES as a system of record, so I trying to understand
its consistency and durability related guarantees and knobs.

For this scenario assume I will have a 2 copies ( 1 replica). From what I
have read and gleaned (
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/docs-index_.html#index-consistency)
=>

  • write.consistency is a check to see if the operation can be
    executed. It offers no guarantees in terms of data durability or operation
    guarantee
  • replication type => Is this for durability guarantees?
    • sync all copies of the data are saved/indexed on nodes before it
      returns.
    • async primary node will be synchronous the others will be async.

Failure scenario question

Scenario is

=> PUT with write.consistency=quorum, but different replications =
sync/async.
=> What happens if there is a network glitch after the quorum check
succeeds, such that one of the non-primary shard machines is not reachable
or down?

  1. Put with write.consistency=quorum and replication=sync =>
    1. Would the operation fail from the callers perspective? Seems
      like it would after some retries?
    2. I have read that if the primary succeeds, and if it has problems
      with replica then it will try to create a new shard, is that correct
      understanding? I assume that should be preventable to handle network
      failures/partitions?
  2. Put with write.consistency=quorum and replication=async =>
    1. The operation succeeds, since primary shard is up.
    2. Now in sometime the unreachable node is back, how does it get
      that copy of data that it did not get, is there some steps/process that
      will do that automatically?

If these have been asked/answered or documents, please point me to a
source and my apologies.

Thanks
Anand

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CANN9sNpFY39T2t2uy51Zh0m1p%3DBpqtqb7f9ho1iWTV%3DeUab23Q%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CANN9sNpFY39T2t2uy51Zh0m1p%3DBpqtqb7f9ho1iWTV%3DeUab23Q%40mail.gmail.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoG2SqYCjnq7WO5BCJZdPXtVDKbmcuwLB0WQU5Djcvd-rg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(shikhar) #4

On Thu, Jun 12, 2014 at 8:42 PM, joergprante@gmail.com <
joergprante@gmail.com> wrote:

1.1 Not sure what you mean "after the quorum check". Write consistency is
a model where ES makes sure there are enough recipients (nodes) before
writes are executed. consistency=quorum fails if you have too few nodes to
match the actual replica count of the index.

This is the part that I am also after. ES currently does not seem to
provide any guarantee that an acknowledged write (from the caller's
perspective) succeeded on a quorum of replicas. This makes it unsuitable
for a primary data store, given you can see data loss despite having
replicas!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAHWG4DPDXoRzF1wXOK44MWKihR2R0A_JVZnZ1L1T6dxq7JVySg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(shikhar) #5

On Thu, Jun 12, 2014 at 8:52 PM, shikhar shikhar@schmizz.net wrote:

ES currently does not seem to provide any guarantee that an acknowledged
write (from the caller's perspective) succeeded on a quorum of replicas.

I take this back, I understand the ES model better now. So although the
write-consistency-level check is only applied before the write is about to
be issued, with sync replication the client can only get an ack if it
succeded on the primary shard as well as all replicas (as per the same
cluster state as the check is performed on). In case it fails on some
replica(s), the operation would be retried (together with the
write-consistency-level check using a possibly-updated cluster state).

This makes it unsuitable for a primary data store, given you can see data
loss despite having replicas!

If using ES as a primary store, you should really be running it with

  • index.gateway.local.sync: 0*
    to make sure the translog fsync's on every write operation

........

A follow-up question: what if there is a failure on one of the replicas
that prevents writes (e.g. disk full) but this is not preventing the node
from dropping out of discovery state due to being healthy otherwise? Does
it not make that node a SPOF? This is something we have run into with
SolrCloud https://issues.apache.org/jira/browse/SOLR-5805.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAHWG4DOi0sTu5_mwv%3DNJL5SH7%3D1Z5CG2iULSbF0_P7ZDULY-qw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(Jörg Prante) #6

index.gateway.local.sync: 0 is related to durability, it means, the
underlying data is really going to disk by using the guarantee of
FileChannel.force(false). This destroys performance compared to the default
value of ES, because there are a lot more I/O operations on OS layer when
fsync() is performed at each operation.

Note, there is a small time gap between the API call, the transfer of the
API call to the executing node(s), and the execution of the translog write
ahead operation logging (no matter if sync=0 or sync!=0). In the early
stage of request processing, a response may not have been sent back to the
client, because the request is still on the way. In rare cases, all nodes
that are determined to receive the API call may crash, and also the
responding node is not able to create a response for sending it back to the
client. So it is up to the ES client to maintain a list of yet unanswered
API calls, and wait for successes, failures, and timeouts of all requests
being sent. API calls not answered should be treated accordingly at client
side, to avoid data inconsistency. For example, unanswered insertion
operations should be repeated (after the ES cluster comes up again).

This situation is rare - sudden fail of the ability of an ES cluster to
create responses and a lot of outstanding API requests. And on
well-balanced ES clusters, the "gap" should be minimal and almost not
noticeable. On heavy loaded clusters, or on clusters with slow I/O, the
latencies are high, and the gap may be more noticeable. The default
settings of ES helps a lot to balance out these situations, by trying to
prevent the "overrun" of the ES API.

Jörg

On Fri, Jun 13, 2014 at 8:41 AM, shikhar shikhar@schmizz.net wrote:

On Thu, Jun 12, 2014 at 8:52 PM, shikhar shikhar@schmizz.net wrote:

ES currently does not seem to provide any guarantee that an acknowledged
write (from the caller's perspective) succeeded on a quorum of replicas.

I take this back, I understand the ES model better now. So although the
write-consistency-level check is only applied before the write is about to
be issued, with sync replication the client can only get an ack if it
succeded on the primary shard as well as all replicas (as per the same
cluster state as the check is performed on). In case it fails on some
replica(s), the operation would be retried (together with the
write-consistency-level check using a possibly-updated cluster state).

This makes it unsuitable for a primary data store, given you can see data
loss despite having replicas!

If using ES as a primary store, you should really be running it with

  • index.gateway.local.sync: 0*
    to make sure the translog fsync's on every write operation

........

A follow-up question: what if there is a failure on one of the replicas
that prevents writes (e.g. disk full) but this is not preventing the node
from dropping out of discovery state due to being healthy otherwise? Does
it not make that node a SPOF? This is something we have run into with
SolrCloud https://issues.apache.org/jira/browse/SOLR-5805.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAHWG4DOi0sTu5_mwv%3DNJL5SH7%3D1Z5CG2iULSbF0_P7ZDULY-qw%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAHWG4DOi0sTu5_mwv%3DNJL5SH7%3D1Z5CG2iULSbF0_P7ZDULY-qw%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoEbbuO6uWE5ZdxyR6Z%3DTTzUe9Bzt3%2Bcqf%2BFjo9w_rKRyA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(shikhar) #7

On Fri, Jun 13, 2014 at 12:11 PM, shikhar shikhar@schmizz.net wrote:

I take this back, I understand the ES model better now. So although the
write-consistency-level check is only applied before the write is about
to be issued, with sync replication the client can only get an ack if it
succeded on the primary shard as well as all replicas (as per the same
cluster state as the check is performed on). In case it fails on some
replica(s), the operation would be retried (together with the write-
consistency-level check using a possibly-updated cluster state).

FWIW I'm really not sure anymore. TransportShardReplicationOperationAction
where this stuff is happening has a bunch of logic in performReplicas(..)
https://github.com/elasticsearch/elasticsearch/blob/a06fd46a72193a387024b00e226241511a3851d0/src/main/java/org/elasticsearch/action/support/replication/TransportShardReplicationOperationAction.java#L557-L676
where it decides to take into account updated cluster state, and there seem
to be exceptions for certain kinds of failures being tolerated.

Seems like this would be so much more straightforward if a write were to be
fanned-out and then block uptil max of timeout for checking that the
requried number of replicas succeeded (with success on primary being
required).

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAHWG4DMM_-0ySs3iCFkVKxL3DrNW2vT9daUAT9eWfXHpUrN2wQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(system) #8