Update/re-index an document

Extracted from
http://groups.google.com/group/elasticsearch/browse_thread/thread/cbd2cc71c407e435

How to do updates (in RDMS terminology) to a document? Do I need to find
the existing document (e.g. by id), delete the existing document and
insert () a new document with the combined information from the old
document and the new information I have to add to it? Or are there any
other way updating documents? What about transaction isolation when
doing this - if two processes are updating an existing document "at the
same time" will I be sure that one of them will fail and that the other
one will succeed?

When updating I need to be able to find the document that has to be
updated without involving all shards, or else I will not be able to
scale in number-of-possible-updates-per-time-unit - that is, I will not
be able to just buy more hardware to be able to support more
updates-per-time-unit, just as I expect to be able to support more
inserts-per-time-unit by buying more hardware. When I want to update I
know that only 0 or 1 document will exist living up to the
search-criterias I will use to find the document to be updated, and that
the query will therefore return a resultset of size 0 or 1. In order to
not involve all shards for such queries, there need to be some kind of
configuration (the same as the one controlling the destination of a new
document among shards) that ES is able to take into consideration when
performing the search - only ask the one shard where it know the
document will exist if it exists. What kind solutions do you have in
this area? It this possible? Only on id's of the documents? Or?

Answer already received in
http://groups.google.com/group/elasticsearch/browse_thread/thread/cbd2cc71c407e435:
To update a document, you read document, make changes to document, index
document. Optimistic concurrency is supported using versioning.

Comment to answer: Ok, as I understand your answer there IS such a
concept as "update" (in RDMS terminology) in ES. I thought that indexing
a document would always be considered as an "insert" (in RDMS
terminology). As I understand you the "index" operation in ES can be
used for both "inserting" and "updating". But that requires that ES is
able to see if a document you try to index is a "new" document or an
"updated" version of an existing document. Who does ES know if it is one
or the other by looking at the document?

---- Status ----
Questions answered:

  • Guess routing can be used to make sure that not all shards will be
    contacted in order to find a specific document to be updated.
    Questions still awaiting answers:
  • As I understand the answer above, the "index" operation can be used
    for both "insert" and "update" (in RDMS terminology) of documents. How
    does ES know when a document sent for indexing is a new version of an
    already existing document and when it is actually a new document? Is it
    based on the value of the id of the document, or the fact that an
    version field exists, or ...?
  • As I understand the answer above, there is a version-feature in ES
    enabling "optimistic locking" (if a document has changed between the
    time it was read and the time it is sent for re-indexing, the
    re-indexing operation will fail). It that true?
  • Can you please provide me with a code example, first indexing a new
    document, then finding that document again for updating and re-indexing.
    Please including "optimistic locking" feature enabled (if it needs to be
    so explicitly), so that if the find/re-index is run concurrently in two
    threads so that they both get to find/read before any of them does
    re-index, then one of them will succeed and one of them will fail. Thanks!

Regards, Per Steffensen

Questions answered at the end:

On Mon, Sep 12, 2011 at 11:38 AM, Per Steffensen steff@designware.dkwrote:

Extracted from http://groups.google.com/**group/elasticsearch/browse_**
thread/thread/cbd2cc71c407e435http://groups.google.com/group/elasticsearch/browse_thread/thread/cbd2cc71c407e435

How to do updates (in RDMS terminology) to a document? Do I need to find
the existing document (e.g. by id), delete the existing document and insert
() a new document with the combined information from the old document and
the new information I have to add to it? Or are there any other way updating
documents? What about transaction isolation when doing this - if two
processes are updating an existing document "at the same time" will I be
sure that one of them will fail and that the other one will succeed?

When updating I need to be able to find the document that has to be updated
without involving all shards, or else I will not be able to scale in
number-of-possible-updates-**per-time-unit - that is, I will not be able
to just buy more hardware to be able to support more updates-per-time-unit,
just as I expect to be able to support more inserts-per-time-unit by buying
more hardware. When I want to update I know that only 0 or 1 document will
exist living up to the search-criterias I will use to find the document to
be updated, and that the query will therefore return a resultset of size 0
or 1. In order to not involve all shards for such queries, there need to be
some kind of configuration (the same as the one controlling the destination
of a new document among shards) that ES is able to take into consideration
when performing the search - only ask the one shard where it know the
document will exist if it exists. What kind solutions do you have in this
area? It this possible? Only on id's of the documents? Or?

Answer already received in http://groups.google.com/**
group/elasticsearch/browse_**thread/thread/cbd2cc71c407e435http://groups.google.com/group/elasticsearch/browse_thread/thread/cbd2cc71c407e435
**: To update a document, you read document, make changes to document,
index document. Optimistic concurrency is supported using versioning.

Comment to answer: Ok, as I understand your answer there IS such a concept
as "update" (in RDMS terminology) in ES. I thought that indexing a document
would always be considered as an "insert" (in RDMS terminology). As I
understand you the "index" operation in ES can be used for both "inserting"
and "updating". But that requires that ES is able to see if a document you
try to index is a "new" document or an "updated" version of an existing
document. Who does ES know if it is one or the other by looking at the
document?

---- Status ----
Questions answered:

  • Guess routing can be used to make sure that not all shards will be
    contacted in order to find a specific document to be updated.
    Questions still awaiting answers:
  • As I understand the answer above, the "index" operation can be used for
    both "insert" and "update" (in RDMS terminology) of documents. How does ES
    know when a document sent for indexing is a new version of an already
    existing document and when it is actually a new document? Is it based on the
    value of the id of the document, or the fact that an version field exists,
    or ...?

It can check if the document already exists and what its version is when
indexing the doc against the index (in real time manner).

  • As I understand the answer above, there is a version-feature in ES
    enabling "optimistic locking" (if a document has changed between the time it
    was read and the time it is sent for re-indexing, the re-indexing operation
    will fail). It that true?

Yes. As long as you provide the version when indexing. A typical scenario
would be to "get" a document (hits a single shard), and index / update the
document while providing the version you have form the "get" operation.

  • Can you please provide me with a code example, first indexing a new
    document, then finding that document again for updating and re-indexing.
    Please including "optimistic locking" feature enabled (if it needs to be so
    explicitly), so that if the find/re-index is run concurrently in two threads
    so that they both get to find/read before any of them does re-index, then
    one of them will succeed and one of them will fail. Thanks!

Elasticsearch Platform — Find real-time answers at scale | Elastic.

Regards, Per Steffensen

Shay Banon skrev:

Questions answered at the end:

On Mon, Sep 12, 2011 at 11:38 AM, Per Steffensen <steff@designware.dk
mailto:steff@designware.dk> wrote:

Extracted from
http://groups.google.com/group/elasticsearch/browse_thread/thread/cbd2cc71c407e435

How to do updates (in RDMS terminology) to a document? Do I need
to find the existing document (e.g. by id), delete the existing
document and insert () a new document with the combined
information from the old document and the new information I have
to add to it? Or are there any other way updating documents? What
about transaction isolation when doing this - if two processes are
updating an existing document "at the same time" will I be sure
that one of them will fail and that the other one will succeed?

When updating I need to be able to find the document that has to
be updated without involving all shards, or else I will not be
able to scale in number-of-possible-updates-per-time-unit - that
is, I will not be able to just buy more hardware to be able to
support more updates-per-time-unit, just as I expect to be able to
support more inserts-per-time-unit by buying more hardware. When I
want to update I know that only 0 or 1 document will exist living
up to the search-criterias I will use to find the document to be
updated, and that the query will therefore return a resultset of
size 0 or 1. In order to not involve all shards for such queries,
there need to be some kind of configuration (the same as the one
controlling the destination of a new document among shards) that
ES is able to take into consideration when performing the search -
only ask the one shard where it know the document will exist if it
exists. What kind solutions do you have in this area? It this
possible? Only on id's of the documents? Or?

Answer already received in
http://groups.google.com/group/elasticsearch/browse_thread/thread/cbd2cc71c407e435:
To update a document, you read document, make changes to document,
index document. Optimistic concurrency is supported using versioning.

Comment to answer: Ok, as I understand your answer there IS such a
concept as "update" (in RDMS terminology) in ES. I thought that
indexing a document would always be considered as an "insert" (in
RDMS terminology). As I understand you the "index" operation in ES
can be used for both "inserting" and "updating". But that requires
that ES is able to see if a document you try to index is a "new"
document or an "updated" version of an existing document. Who does
ES know if it is one or the other by looking at the document?

---- Status ----
Questions answered:
- Guess routing can be used to make sure that not all shards will
be contacted in order to find a specific document to be updated.
Questions still awaiting answers:
- As I understand the answer above, the "index" operation can be
used for both "insert" and "update" (in RDMS terminology) of
documents. How does ES know when a document sent for indexing is a
new version of an already existing document and when it is
actually a new document? Is it based on the value of the id of the
document, or the fact that an version field exists, or ...?

It can check if the document already exists and what its version is
when indexing the doc against the index (in real time manner).
You are still not answering the question - probably because the answer
is so obvious to you that it is not worth answering :slight_smile: My question is
about how ES knows whether a document I send for indexing is a new
document or if it is an updated version of an existing document.
Basically a document sent for indexing is just some JSON sent over HTTP,
there is nothing physical involved in both the get-operation and the
index-operation that tells ES that the document sent for indexing is
actually an updated version of the document just retrieved using the
get-operation. I just want to know the set of fields that ES uses to
find out that a document sent to it is an updated version of an existing
document. I guess that the answer is that a document is a new version of
an existing document iff the type/_id corresponds to a document already
existing in the index.

- As I understand the answer above, there is a version-feature in
ES enabling "optimistic locking" (if a document has changed
between the time it was read and the time it is sent for
re-indexing, the re-indexing operation will fail). It that true?

Yes. As long as you provide the version when indexing. A typical
scenario would be to "get" a document (hits a single shard), and index
/ update the document while providing the version you have form the
"get" operation.
Again (as with one of my other questions) I guess this will only work if
I (as a programmer of apps operating against ES) make sure to use the
same routing value then I do the original indexing of a document, and
when I do the update-indexing of the same document.

- Can you please provide me with a code example, first indexing a
new document, then finding that document again for updating and
re-indexing. Please including "optimistic locking" feature enabled
(if it needs to be so explicitly), so that if the find/re-index is
run concurrently in two threads so that they both get to find/read
before any of them does re-index, then one of them will succeed
and one of them will fail. Thanks!

Elasticsearch Platform — Find real-time answers at scale | Elastic.
Thanks!

Regards, Per Steffensen

On Mon, Sep 12, 2011 at 1:10 PM, Per Steffensen steff@designware.dk wrote:

**
Shay Banon skrev:

Questions answered at the end:

On Mon, Sep 12, 2011 at 11:38 AM, Per Steffensen steff@designware.dkwrote:

Extracted from
http://groups.google.com/group/elasticsearch/browse_thread/thread/cbd2cc71c407e435

How to do updates (in RDMS terminology) to a document? Do I need to find
the existing document (e.g. by id), delete the existing document and insert
() a new document with the combined information from the old document and
the new information I have to add to it? Or are there any other way updating
documents? What about transaction isolation when doing this - if two
processes are updating an existing document "at the same time" will I be
sure that one of them will fail and that the other one will succeed?

When updating I need to be able to find the document that has to be
updated without involving all shards, or else I will not be able to scale in
number-of-possible-updates-per-time-unit - that is, I will not be able to
just buy more hardware to be able to support more updates-per-time-unit,
just as I expect to be able to support more inserts-per-time-unit by buying
more hardware. When I want to update I know that only 0 or 1 document will
exist living up to the search-criterias I will use to find the document to
be updated, and that the query will therefore return a resultset of size 0
or 1. In order to not involve all shards for such queries, there need to be
some kind of configuration (the same as the one controlling the destination
of a new document among shards) that ES is able to take into consideration
when performing the search - only ask the one shard where it know the
document will exist if it exists. What kind solutions do you have in this
area? It this possible? Only on id's of the documents? Or?

Answer already received in
http://groups.google.com/group/elasticsearch/browse_thread/thread/cbd2cc71c407e435:
To update a document, you read document, make changes to document, index
document. Optimistic concurrency is supported using versioning.

Comment to answer: Ok, as I understand your answer there IS such a concept
as "update" (in RDMS terminology) in ES. I thought that indexing a document
would always be considered as an "insert" (in RDMS terminology). As I
understand you the "index" operation in ES can be used for both "inserting"
and "updating". But that requires that ES is able to see if a document you
try to index is a "new" document or an "updated" version of an existing
document. Who does ES know if it is one or the other by looking at the
document?

---- Status ----
Questions answered:

  • Guess routing can be used to make sure that not all shards will be
    contacted in order to find a specific document to be updated.
    Questions still awaiting answers:
  • As I understand the answer above, the "index" operation can be used for
    both "insert" and "update" (in RDMS terminology) of documents. How does ES
    know when a document sent for indexing is a new version of an already
    existing document and when it is actually a new document? Is it based on the
    value of the id of the document, or the fact that an version field exists,
    or ...?

It can check if the document already exists and what its version is when
indexing the doc against the index (in real time manner).

You are still not answering the question - probably because the answer is
so obvious to you that it is not worth answering :slight_smile: My question is about
how ES knows whether a document I send for indexing is a new document or if
it is an updated version of an existing document. Basically a document sent
for indexing is just some JSON sent over HTTP, there is nothing physical
involved in both the get-operation and the index-operation that tells ES
that the document sent for indexing is actually an updated version of the
document just retrieved using the get-operation. I just want to know the set
of fields that ES uses to find out that a document sent to it is an updated
version of an existing document. I guess that the answer is that a document
is a new version of an existing document iff the type/_id corresponds to a
document already existing in the index.

I already answered that. The type/id ends up being the unique identifier of
a document within an index.

  • As I understand the answer above, there is a version-feature in ES
    enabling "optimistic locking" (if a document has changed between the time it
    was read and the time it is sent for re-indexing, the re-indexing operation
    will fail). It that true?

Yes. As long as you provide the version when indexing. A typical scenario
would be to "get" a document (hits a single shard), and index / update the
document while providing the version you have form the "get" operation.

Again (as with one of my other questions) I guess this will only work if I
(as a programmer of apps operating against ES) make sure to use the same
routing value then I do the original indexing of a document, and when I do
the update-indexing of the same document.

You don't have to use custom routing value. By default, the routing value
is the id of hte document (which you have to provide when updating a
document). If you do provide a custom routing value, then you need to make
sure to provide the same one when you want to update the document.

  • Can you please provide me with a code example, first indexing a new
    document, then finding that document again for updating and re-indexing.
    Please including "optimistic locking" feature enabled (if it needs to be so
    explicitly), so that if the find/re-index is run concurrently in two threads
    so that they both get to find/read before any of them does re-index, then
    one of them will succeed and one of them will fail. Thanks!

Elasticsearch Platform — Find real-time answers at scale | Elastic.

Thanks!

Regards, Per Steffensen

I will provide that answers to my questions below, as I understand the
answers should be - if order to make it easy for future readers of this
thread to find their answers here.

Shay Banon skrev:

On Mon, Sep 12, 2011 at 1:10 PM, Per Steffensen <steff@designware.dk
mailto:steff@designware.dk> wrote:

Shay Banon skrev:
Questions answered at the end:

On Mon, Sep 12, 2011 at 11:38 AM, Per Steffensen
<steff@designware.dk <mailto:steff@designware.dk>> wrote:

    Extracted from
    http://groups.google.com/group/elasticsearch/browse_thread/thread/cbd2cc71c407e435

    How to do updates (in RDMS terminology) to a document? Do I
    need to find the existing document (e.g. by id), delete the
    existing document and insert () a new document with the
    combined information from the old document and the new
    information I have to add to it? Or are there any other way
    updating documents?

Yes, updating is done by (re-)indexing the updated document with the
same type/_id as the original document - that will update the existing
document.

    What about transaction isolation when doing this - if two
    processes are updating an existing document "at the same
    time" will I be sure that one of them will fail and that the
    other one will succeed?

Unless custom routing is used in a wrong way (where you do NOT make sure
to route the updated document to the same shard as the original document
(use the same _routing value)) or unless you do not provide the correct
version-value in the update (re-indexing) operation, concurrent updates
of the same document will work correctly (one and only one will succeed)
due to the "optimistic locking" feature
(Elasticsearch Platform — Find real-time answers at scale | Elastic)

    When updating I need to be able to find the document that has
    to be updated without involving all shards, or else I will
    not be able to scale in
    number-of-possible-updates-per-time-unit - that is, I will
    not be able to just buy more hardware to be able to support
    more updates-per-time-unit, just as I expect to be able to
    support more inserts-per-time-unit by buying more hardware.
    When I want to update I know that only 0 or 1 document will
    exist living up to the search-criterias I will use to find
    the document to be updated, and that the query will therefore
    return a resultset of size 0 or 1. In order to not involve
    all shards for such queries, there need to be some kind of
    configuration (the same as the one controlling the
    destination of a new document among shards) that ES is able
    to take into consideration when performing the search - only
    ask the one shard where it know the document will exist if it
    exists. What kind solutions do you have in this area? It this
    possible? Only on id's of the documents? Or?

Use routing to make sure that the "query" goes to the one shard where
you know that the document exist. If you not using custom routing you
need to make sure that you "query" on type and _id (the default routing
parameters)

    Answer already received in
    http://groups.google.com/group/elasticsearch/browse_thread/thread/cbd2cc71c407e435:
    To update a document, you read document, make changes to
    document, index document. Optimistic concurrency is supported
    using versioning.

    Comment to answer: Ok, as I understand your answer there IS
    such a concept as "update" (in RDMS terminology) in ES. I
    thought that indexing a document would always be considered
    as an "insert" (in RDMS terminology). As I understand you the
    "index" operation in ES can be used for both "inserting" and
    "updating". But that requires that ES is able to see if a
    document you try to index is a "new" document or an "updated"
    version of an existing document. Who does ES know if it is
    one or the other by looking at the document?

If type and _id matches an existing document in the index the
"index"-operations will be considered an "update"-operation on that
document.

    ---- Status ----
    Questions answered:
    - Guess routing can be used to make sure that not all shards
    will be contacted in order to find a specific document to be
    updated.
    Questions still awaiting answers:
    - As I understand the answer above, the "index" operation can
    be used for both "insert" and "update" (in RDMS terminology)
    of documents. How does ES know when a document sent for
    indexing is a new version of an already existing document and
    when it is actually a new document? Is it based on the value
    of the id of the document, or the fact that an version field
    exists, or ...?


It can check if the document already exists and what its version
is when indexing the doc against the index (in real time manner).
You are still not answering the question - probably because the
answer is so obvious to you that it is not worth answering :-) My
question is about how ES knows whether a document I send for
indexing is a new document or if it is an updated version of an
existing document. Basically a document sent for indexing is just
some JSON sent over HTTP, there is nothing physical involved in
both the get-operation and the index-operation that tells ES that
the document sent for indexing is actually an updated version of
the document just retrieved using the get-operation. I just want
to know the set of fields that ES uses to find out that a document
sent to it is an updated version of an existing document. I guess
that the answer is that a document is a new version of an existing
document iff the type/_id corresponds to a document already
existing in the index.

I already answered that. The type/id ends up being the unique
identifier of a document within an index.
You didnt already answer that. But now you did. Thanks. I will
understand your answer this way: a document sent for indexing is
considered "the same as an existing document" iff it has the same values
for type and _id. Therefore an indexing operation will be considered an
"update" iff there exists an document in the index already with the same
value for type and _id, as the document being indexed. This is not so
obvious as you might think.

    - As I understand the answer above, there is a
    version-feature in ES enabling "optimistic locking" (if a
    document has changed between the time it was read and the
    time it is sent for re-indexing, the re-indexing operation
    will fail). It that true?


Yes. As long as you provide the version when indexing. A typical
scenario would be to "get" a document (hits a single shard), and
index / update the document while providing the version you have
form the "get" operation.
Again (as with one of my other questions) I guess this will only
work if I (as a programmer of apps operating against ES) make sure
to use the same routing value then I do the original indexing of a
document, and when I do the update-indexing of the same document.

You don't have to use custom routing value.
I know.
By default, the routing value is the id of hte document (which you
have to provide when updating a document). If you do provide a custom
routing value, then you need to make sure to provide the same one when
you want to update the document.
As stated somewhere else, I would state this clarly in the documentation
about routing.

    - Can you please provide me with a code example, first
    indexing a new document, then finding that document again for
    updating and re-indexing. Please including "optimistic
    locking" feature enabled (if it needs to be so explicitly),
    so that if the find/re-index is run concurrently in two
    threads so that they both get to find/read before any of them
    does re-index, then one of them will succeed and one of them
    will fail. Thanks!


http://www.elasticsearch.org/blog/2011/02/08/versioning.html.
Thanks!
    Regards, Per Steffensen