NRT Search and Read/Write consistency


(Prashanth Menon) #1

Hi everyone,

Though the documentation I've read is great, I haven't quiet found the
answer to my concern around NRT and quorem reads (similar to this:
http://elasticsearch-users.115913.n3.nabble.com/quorum-consistency-and-datacenter-awareness-td696454.html#a696491
and this:
http://groups.google.com/a/elasticsearch.com/group/users/browse_thread/thread/44f9bea27fe117fe/68aa72d73233db83?lnk=gst&q=distributed+search#68aa72d73233db83).

Currently, I'm doing some high volume processing that requires heavy
searching into our ES cluster with several indices each with 3-5
shards and at least 2 replicas. The general process is to query ES
and insert/index a document if not found or update an existing
document. According to what I understand, each query (get/search)
gets hashed to a shard id and is serviced by the replica group
(primary shard + replica set). The problem I foresee is as follows:

In a multi-threaded case: Two threads attempting updates. Thread one
searches term "d1" in index "i1", does not find it, performs index and
insert or a new document with a random id. At the same time, thread
two performs search for the same "d1" term in index "i1" but does not
find it (just a timing issue). Thread two then has to also perform an
index and insert of a new document with a new random id. So we have
duplicate document data here (or overwrites had the ID's been the
same). Correct me if I'm wrong but versioning doesn't work here
because the ID's of the documents are not the same, so what can I do?

In the single-threaded case: During processing, we query for "d1", do
not find it and index a new document (presumably into the primary
shard since it's "dirty" operation?). Later on we perform a search
for the same field "d1" in the same index but it gets hashed/serviced
by a replica that hasn't received the update from the primary yet.
How can I get around NRT without forcing strong consistency? Is there
a quorum read feature I can use?


(Prashanth Menon) #2

So, it looks like the single-threaded case can be solved by using the
"preference" attribute when searching to point to the "primary" shard
only: https://github.com/elasticsearch/elasticsearch/issues/769. Any
word on the multi-threaded case?

On Aug 20, 11:45 pm, Prashanth Menon prashanth.men...@gmail.com
wrote:

Hi everyone,

Though the documentation I've read is great, I haven't quiet found the
answer to my concern around NRT and quorem reads (similar to this:http://elasticsearch-users.115913.n3.nabble.com/quorum-consistency-an...
and this:http://groups.google.com/a/elasticsearch.com/group/users/browse_threa...).

Currently, I'm doing some high volume processing that requires heavy
searching into our ES cluster with several indices each with 3-5
shards and at least 2 replicas. The general process is to query ES
and insert/index a document if not found or update an existing
document. According to what I understand, each query (get/search)
gets hashed to a shard id and is serviced by the replica group
(primary shard + replica set). The problem I foresee is as follows:

In a multi-threaded case: Two threads attempting updates. Thread one
searches term "d1" in index "i1", does not find it, performs index and
insert or a new document with a random id. At the same time, thread
two performs search for the same "d1" term in index "i1" but does not
find it (just a timing issue). Thread two then has to also perform an
index and insert of a new document with a new random id. So we have
duplicate document data here (or overwrites had the ID's been the
same). Correct me if I'm wrong but versioning doesn't work here
because the ID's of the documents are not the same, so what can I do?

In the single-threaded case: During processing, we query for "d1", do
not find it and index a new document (presumably into the primary
shard since it's "dirty" operation?). Later on we perform a search
for the same field "d1" in the same index but it gets hashed/serviced
by a replica that hasn't received the update from the primary yet.
How can I get around NRT without forcing strong consistency? Is there
a quorum read feature I can use?


(James Cook) #3

As you stated, I was going to point out the versioning support for
optimistic concurrency until you said the ids would not be the same. Is it
possible for your ids to be the same, and use another field (like 'key') for
this random id you are currently using?

-- jim


(Shay Banon) #4

How would you solve something like that with a relational database? I think
you will see you have exactly the same problem (not talking about the single
thread updater case), unless I am missing something.

On Sun, Aug 21, 2011 at 11:22 PM, Prashanth Menon <
prashanth.menon1@gmail.com> wrote:

So, it looks like the single-threaded case can be solved by using the
"preference" attribute when searching to point to the "primary" shard
only: https://github.com/elasticsearch/elasticsearch/issues/769. Any
word on the multi-threaded case?

On Aug 20, 11:45 pm, Prashanth Menon prashanth.men...@gmail.com
wrote:

Hi everyone,

Though the documentation I've read is great, I haven't quiet found the
answer to my concern around NRT and quorem reads (similar to this:
http://elasticsearch-users.115913.n3.nabble.com/quorum-consistency-an...
and this:
http://groups.google.com/a/elasticsearch.com/group/users/browse_threa...).

Currently, I'm doing some high volume processing that requires heavy
searching into our ES cluster with several indices each with 3-5
shards and at least 2 replicas. The general process is to query ES
and insert/index a document if not found or update an existing
document. According to what I understand, each query (get/search)
gets hashed to a shard id and is serviced by the replica group
(primary shard + replica set). The problem I foresee is as follows:

In a multi-threaded case: Two threads attempting updates. Thread one
searches term "d1" in index "i1", does not find it, performs index and
insert or a new document with a random id. At the same time, thread
two performs search for the same "d1" term in index "i1" but does not
find it (just a timing issue). Thread two then has to also perform an
index and insert of a new document with a new random id. So we have
duplicate document data here (or overwrites had the ID's been the
same). Correct me if I'm wrong but versioning doesn't work here
because the ID's of the documents are not the same, so what can I do?

In the single-threaded case: During processing, we query for "d1", do
not find it and index a new document (presumably into the primary
shard since it's "dirty" operation?). Later on we perform a search
for the same field "d1" in the same index but it gets hashed/serviced
by a replica that hasn't received the update from the primary yet.
How can I get around NRT without forcing strong consistency? Is there
a quorum read feature I can use?


(Prashanth Menon) #5

Hi Shay, James,

Haha, this is incredible. I asked myself that exact same question
moments after I made my post (can't find an edit button though). I
actually simplified my scenario but have an idea of how to solve it.
What I'm much happier about was that we had the same train of
thought.

Great work so far Shay, what you've got here is a pretty awesome
product.

  • Prashanth

On Aug 22, 3:50 am, Shay Banon kim...@gmail.com wrote:

How would you solve something like that with a relational database? I think
you will see you have exactly the same problem (not talking about the single
thread updater case), unless I am missing something.

On Sun, Aug 21, 2011 at 11:22 PM, Prashanth Menon <

prashanth.men...@gmail.com> wrote:

So, it looks like the single-threaded case can be solved by using the
"preference" attribute when searching to point to the "primary" shard
only:https://github.com/elasticsearch/elasticsearch/issues/769. Any
word on the multi-threaded case?

On Aug 20, 11:45 pm, Prashanth Menon prashanth.men...@gmail.com
wrote:

Hi everyone,

Though the documentation I've read is great, I haven't quiet found the
answer to my concern around NRT and quorem reads (similar to this:
http://elasticsearch-users.115913.n3.nabble.com/quorum-consistency-an...
and this:
http://groups.google.com/a/elasticsearch.com/group/users/browse_threa...).

Currently, I'm doing some high volume processing that requires heavy
searching into our ES cluster with several indices each with 3-5
shards and at least 2 replicas. The general process is to query ES
and insert/index a document if not found or update an existing
document. According to what I understand, each query (get/search)
gets hashed to a shard id and is serviced by the replica group
(primary shard + replica set). The problem I foresee is as follows:

In a multi-threaded case: Two threads attempting updates. Thread one
searches term "d1" in index "i1", does not find it, performs index and
insert or a new document with a random id. At the same time, thread
two performs search for the same "d1" term in index "i1" but does not
find it (just a timing issue). Thread two then has to also perform an
index and insert of a new document with a new random id. So we have
duplicate document data here (or overwrites had the ID's been the
same). Correct me if I'm wrong but versioning doesn't work here
because the ID's of the documents are not the same, so what can I do?

In the single-threaded case: During processing, we query for "d1", do
not find it and index a new document (presumably into the primary
shard since it's "dirty" operation?). Later on we perform a search
for the same field "d1" in the same index but it gets hashed/serviced
by a replica that hasn't received the update from the primary yet.
How can I get around NRT without forcing strong consistency? Is there
a quorum read feature I can use?


(system) #6