Currently, I'm doing some high volume processing that requires heavy
searching into our ES cluster with several indices each with 3-5
shards and at least 2 replicas. The general process is to query ES
and insert/index a document if not found or update an existing
document. According to what I understand, each query (get/search)
gets hashed to a shard id and is serviced by the replica group
(primary shard + replica set). The problem I foresee is as follows:
In a multi-threaded case: Two threads attempting updates. Thread one
searches term "d1" in index "i1", does not find it, performs index and
insert or a new document with a random id. At the same time, thread
two performs search for the same "d1" term in index "i1" but does not
find it (just a timing issue). Thread two then has to also perform an
index and insert of a new document with a new random id. So we have
duplicate document data here (or overwrites had the ID's been the
same). Correct me if I'm wrong but versioning doesn't work here
because the ID's of the documents are not the same, so what can I do?
In the single-threaded case: During processing, we query for "d1", do
not find it and index a new document (presumably into the primary
shard since it's "dirty" operation?). Later on we perform a search
for the same field "d1" in the same index but it gets hashed/serviced
by a replica that hasn't received the update from the primary yet.
How can I get around NRT without forcing strong consistency? Is there
a quorum read feature I can use?
Currently, I'm doing some high volume processing that requires heavy
searching into our ES cluster with several indices each with 3-5
shards and at least 2 replicas. The general process is to query ES
and insert/index a document if not found or update an existing
document. According to what I understand, each query (get/search)
gets hashed to a shard id and is serviced by the replica group
(primary shard + replica set). The problem I foresee is as follows:
In a multi-threaded case: Two threads attempting updates. Thread one
searches term "d1" in index "i1", does not find it, performs index and
insert or a new document with a random id. At the same time, thread
two performs search for the same "d1" term in index "i1" but does not
find it (just a timing issue). Thread two then has to also perform an
index and insert of a new document with a new random id. So we have
duplicate document data here (or overwrites had the ID's been the
same). Correct me if I'm wrong but versioning doesn't work here
because the ID's of the documents are not the same, so what can I do?
In the single-threaded case: During processing, we query for "d1", do
not find it and index a new document (presumably into the primary
shard since it's "dirty" operation?). Later on we perform a search
for the same field "d1" in the same index but it gets hashed/serviced
by a replica that hasn't received the update from the primary yet.
How can I get around NRT without forcing strong consistency? Is there
a quorum read feature I can use?
As you stated, I was going to point out the versioning support for
optimistic concurrency until you said the ids would not be the same. Is it
possible for your ids to be the same, and use another field (like 'key') for
this random id you are currently using?
How would you solve something like that with a relational database? I think
you will see you have exactly the same problem (not talking about the single
thread updater case), unless I am missing something.
Currently, I'm doing some high volume processing that requires heavy
searching into our ES cluster with several indices each with 3-5
shards and at least 2 replicas. The general process is to query ES
and insert/index a document if not found or update an existing
document. According to what I understand, each query (get/search)
gets hashed to a shard id and is serviced by the replica group
(primary shard + replica set). The problem I foresee is as follows:
In a multi-threaded case: Two threads attempting updates. Thread one
searches term "d1" in index "i1", does not find it, performs index and
insert or a new document with a random id. At the same time, thread
two performs search for the same "d1" term in index "i1" but does not
find it (just a timing issue). Thread two then has to also perform an
index and insert of a new document with a new random id. So we have
duplicate document data here (or overwrites had the ID's been the
same). Correct me if I'm wrong but versioning doesn't work here
because the ID's of the documents are not the same, so what can I do?
In the single-threaded case: During processing, we query for "d1", do
not find it and index a new document (presumably into the primary
shard since it's "dirty" operation?). Later on we perform a search
for the same field "d1" in the same index but it gets hashed/serviced
by a replica that hasn't received the update from the primary yet.
How can I get around NRT without forcing strong consistency? Is there
a quorum read feature I can use?
Haha, this is incredible. I asked myself that exact same question
moments after I made my post (can't find an edit button though). I
actually simplified my scenario but have an idea of how to solve it.
What I'm much happier about was that we had the same train of
thought.
Great work so far Shay, what you've got here is a pretty awesome
product.
How would you solve something like that with a relational database? I think
you will see you have exactly the same problem (not talking about the single
thread updater case), unless I am missing something.
On Sun, Aug 21, 2011 at 11:22 PM, Prashanth Menon <
Currently, I'm doing some high volume processing that requires heavy
searching into our ES cluster with several indices each with 3-5
shards and at least 2 replicas. The general process is to query ES
and insert/index a document if not found or update an existing
document. According to what I understand, each query (get/search)
gets hashed to a shard id and is serviced by the replica group
(primary shard + replica set). The problem I foresee is as follows:
In a multi-threaded case: Two threads attempting updates. Thread one
searches term "d1" in index "i1", does not find it, performs index and
insert or a new document with a random id. At the same time, thread
two performs search for the same "d1" term in index "i1" but does not
find it (just a timing issue). Thread two then has to also perform an
index and insert of a new document with a new random id. So we have
duplicate document data here (or overwrites had the ID's been the
same). Correct me if I'm wrong but versioning doesn't work here
because the ID's of the documents are not the same, so what can I do?
In the single-threaded case: During processing, we query for "d1", do
not find it and index a new document (presumably into the primary
shard since it's "dirty" operation?). Later on we perform a search
for the same field "d1" in the same index but it gets hashed/serviced
by a replica that hasn't received the update from the primary yet.
How can I get around NRT without forcing strong consistency? Is there
a quorum read feature I can use?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.