Gateway, Replica, Bootstrap, Consistency and Large Sort Questions

I'm new to the project so apologies if answers to these questions were
covered elsewhere but I couldn't determine exact answers from the
documentation or other discussion topics.

Some general index gateway questions:

  1. The index gateway docs say that "The second level, in case of a
    complete cluster shutdown / failure, is the gateway support" which I
    read to mean, if no index gateway is configured and all the replicas
    for a partition are stopped, they will not be recoverable. Is this
    correct?
  2. If a primary shard fails and a new one is elected, what level of
    write consistency is guaranteed for the elected primary? See my
    questions below on consistency, but my concern here is data loss
    during failover due to write consistency as it relates to the
    replication for a partition.

My next questions are replica bootstrap and consistency questions. For
these questions, assume I have a 7 node cluster with 1 index with
three partitions and three replicas. If I understand correctly, the
partition/replication could look something like this (sorry for the
ASCII art):

host-0 | host-1 | host-2 |
host-3 | host-4 | host-5 | host-6
Index0 P0 R0 | Index0 P1 R0 | Index0 P2 R0 | Index 0 P0 R1 | Index 0
P1 R1 | Index 0 P2 R1 | Index 0 P0 R2
Index0 P1 R2 | Index0 P2 R2 |

If this is correct, my questions are:

  1. If I change the configuration to have a 4th replica for each
    partition, how does the system elect where R3 for P0,P1 and P2 are
    replicated to? Is it based on load for example, such that host-0 and
    host-1 will not be considered as they already own two replicas?
  2. Subsequently, while the new 4th replica is coming online, how does
    it source its copy of the data? In other words, how does the new
    replica choose which existing replica to pull its copy of the
    replicated index from?
  3. If I have a write that would land in P2 from a client connected to
    host-0, at what point does the write return success to the client?
    After the write has been replicated to all three replicas for P2? One
    replica? A quorum of replicas?
  4. If my cluster experiences a partition such that host-0, host-1 and
    host-2 cannot see host-3, host-4, host-5 and host-6, I now effectively
    have a split quorum - one quorum for P0, and one for P1 and P2
    (assuming the partition replication I describe is possible). In this
    case, can read succeed for any of the partitions? Can writes? I'm
    coming from a Dynamo background where the answer is "it depends" but
    I'm curious how ElasticSearch handles these types of partial quorum
    partitions.

My last questions deal with sorted result sets. Simply put, I'm
curious how the system handles total ordering across a partitioned
data set. Using the host/partition topology above, if I perform a
search query whose results span P0, P1 and P2, and I issue that query
to host-4, it can only locally sort ~.33 of my total results. Assuming
each host performs a local sort of the results, how is total ordering
achieved before results are sent to the client? Are the ordered
results bounded by the heap of one host's JVM?

Thanks and sorry for the barrage of questions.
-erik

Now that I see my previous post, it's difficult to read the partition
replica diagram. Here's the same data in a bit more simple form where
Pn is partition number and Rn is replica number:

host-0:
P0:R0
P1:R2

host-1:
P1:R0
P2:R2

host-2:
P2:R0

host-3:
P0:R1

host-4:
P1:R1

host-5:
P2:R1

host-6:
P0:R2

On Jan 16, 11:41 pm, eonnen eon...@gmail.com wrote:

I'm new to the project so apologies if answers to these questions were
covered elsewhere but I couldn't determine exact answers from the
documentation or other discussion topics.

Some general index gateway questions:

  1. The index gateway docs say that "The second level, in case of a
    complete cluster shutdown / failure, is the gateway support" which I
    read to mean, if no index gateway is configured and all the replicas
    for a partition are stopped, they will not be recoverable. Is this
    correct?
  2. If a primary shard fails and a new one is elected, what level of
    write consistency is guaranteed for the elected primary? See my
    questions below on consistency, but my concern here is data loss
    during failover due to write consistency as it relates to the
    replication for a partition.

My next questions are replica bootstrap and consistency questions. For
these questions, assume I have a 7 node cluster with 1 index with
three partitions and three replicas. If I understand correctly, the
partition/replication could look something like this (sorry for the
ASCII art):

host-0 | host-1 | host-2 |
host-3 | host-4 | host-5 | host-6
Index0 P0 R0 | Index0 P1 R0 | Index0 P2 R0 | Index 0 P0 R1 | Index 0
P1 R1 | Index 0 P2 R1 | Index 0 P0 R2
Index0 P1 R2 | Index0 P2 R2 |

If this is correct, my questions are:

  1. If I change the configuration to have a 4th replica for each
    partition, how does the system elect where R3 for P0,P1 and P2 are
    replicated to? Is it based on load for example, such that host-0 and
    host-1 will not be considered as they already own two replicas?
  2. Subsequently, while the new 4th replica is coming online, how does
    it source its copy of the data? In other words, how does the new
    replica choose which existing replica to pull its copy of the
    replicated index from?
  3. If I have a write that would land in P2 from a client connected to
    host-0, at what point does the write return success to the client?
    After the write has been replicated to all three replicas for P2? One
    replica? A quorum of replicas?
  4. If my cluster experiences a partition such that host-0, host-1 and
    host-2 cannot see host-3, host-4, host-5 and host-6, I now effectively
    have a split quorum - one quorum for P0, and one for P1 and P2
    (assuming the partition replication I describe is possible). In this
    case, can read succeed for any of the partitions? Can writes? I'm
    coming from a Dynamo background where the answer is "it depends" but
    I'm curious how Elasticsearch handles these types of partial quorum
    partitions.

My last questions deal with sorted result sets. Simply put, I'm
curious how the system handles total ordering across a partitioned
data set. Using the host/partition topology above, if I perform a
search query whose results span P0, P1 and P2, and I issue that query
to host-4, it can only locally sort ~.33 of my total results. Assuming
each host performs a local sort of the results, how is total ordering
achieved before results are sent to the client? Are the ordered
results bounded by the heap of one host's JVM?

Thanks and sorry for the barrage of questions.
-erik

On Monday, January 17, 2011 at 9:41 AM, eonnen wrote:

I'm new to the project so apologies if answers to these questions were
covered elsewhere but I couldn't determine exact answers from the
documentation or other discussion topics.

Some general index gateway questions:

  1. The index gateway docs say that "The second level, in case of a
    complete cluster shutdown / failure, is the gateway support" which I
    read to mean, if no index gateway is configured and all the replicas
    for a partition are stopped, they will not be recoverable. Is this
    correct?

the default gateway (local) will be able to restore the state of the cluster from the local data stored on each node. If there is no gateway configured (you set it to none), then once all the shards and their replicas are down, then the state is gone.

  1. If a primary shard fails and a new one is elected, what level of
    write consistency is guaranteed for the elected primary? See my
    questions below on consistency, but my concern here is data loss
    during failover due to write consistency as it relates to the
    replication for a partition.

The election process a replica shard to become primary is fast once a node is down. The index operation will simply wait if there is no primary shard in the cluster until its timeout expires, in which case, you will get an exception / failure on the index operation.

My next questions are replica bootstrap and consistency questions. For
these questions, assume I have a 7 node cluster with 1 index with
three partitions and three replicas. If I understand correctly, the
partition/replication could look something like this (sorry for the
ASCII art):

host-0 | host-1 | host-2 |
host-3 | host-4 | host-5 | host-6
Index0 P0 R0 | Index0 P1 R0 | Index0 P2 R0 | Index 0 P0 R1 | Index 0
P1 R1 | Index 0 P2 R1 | Index 0 P0 R2
Index0 P1 R2 | Index0 P2 R2 |

Not sure I got the art (from the second email as well) :). A simple way to test what happens is to run several local ES nodes (just fire it up several times), create an index, and call the cluster state API, it will list where each shard is.

If this is correct, my questions are:

  1. If I change the configuration to have a 4th replica for each
    partition, how does the system elect where R3 for P0,P1 and P2 are
    replicated to? Is it based on load for example, such that host-0 and
    host-1 will not be considered as they already own two replicas?

The current load balancing scheme is simple: ES will aim at creating an even number of shards between all the nodes. And won't allocate a shard with its replica on the same node. More advance ones are on the "planning board".

  1. Subsequently, while the new 4th replica is coming online, how does
    it source its copy of the data? In other words, how does the new
    replica choose which existing replica to pull its copy of the
    replicated index from?

A new replica will always pull its data from a primary shard, maintaing consistency.

  1. If I have a write that would land in P2 from a client connected to
    host-0, at what point does the write return success to the client?
    After the write has been replicated to all three replicas for P2? One
    replica? A quorum of replicas?

By default, write returns after it has been replicated to all replicas and executed on the primary shard.

  1. If my cluster experiences a partition such that host-0, host-1 and
    host-2 cannot see host-3, host-4, host-5 and host-6, I now effectively
    have a split quorum - one quorum for P0, and one for P1 and P2
    (assuming the partition replication I describe is possible). In this
    case, can read succeed for any of the partitions? Can writes? I'm
    coming from a Dynamo background where the answer is "it depends" but
    I'm curious how Elasticsearch handles these types of partial quorum
    partitions.

You will end up with two fully functional separate clusters. They will also not join each other. The current plan to try and improve this is twofold: The ability to define quorum of nodes expected in the cluster, which, if not met, will cause nodes to either become read only or reject any request. And, with the current ability to dedicate specific nodes as master (non data) nodes and specific ones as data nodes (non master), you will be able to also specify a quorum of masters.

ES is very different than dynamo based systems. More info here (old thread, but still mostly relevant): http://elasticsearch-users.115913.n3.nabble.com/CAP-theorem-td891925.html;cid=1294691830595-93.

My last questions deal with sorted result sets. Simply put, I'm
curious how the system handles total ordering across a partitioned
data set. Using the host/partition topology above, if I perform a
search query whose results span P0, P1 and P2, and I issue that query
to host-4, it can only locally sort ~.33 of my total results. Assuming
each host performs a local sort of the results, how is total ordering
achieved before results are sent to the client? Are the ordered
results bounded by the heap of one host's JVM?

The search is executed on each node, and then the sorted values (along with doc ids) are sent back to the node coordinating the distributed search. Thats called the query phase in ES. Then, once data is sorted, a fetch is issued to fetch the relevant data from the relevant shards.

Thanks and sorry for the barrage of questions.
-erik