I will provide that answers to my questions below, as I understand the
answers should be - if order to make it easy for future readers of this
thread to find their answers here.
Shay Banon skrev:
On Mon, Sep 12, 2011 at 1:10 PM, Per Steffensen <firstname.lastname@example.org
Shay Banon skrev:
Questions answered at the end:
On Mon, Sep 12, 2011 at 11:38 AM, Per Steffensen
<email@example.com <mailto:firstname.lastname@example.org>> wrote:
How to do updates (in RDMS terminology) to a document? Do I
need to find the existing document (e.g. by id), delete the
existing document and insert () a new document with the
combined information from the old document and the new
information I have to add to it? Or are there any other way
Yes, updating is done by (re-)indexing the updated document with the
same type/_id as the original document - that will update the existing
What about transaction isolation when doing this - if two
processes are updating an existing document "at the same
time" will I be sure that one of them will fail and that the
other one will succeed?
Unless custom routing is used in a wrong way (where you do NOT make sure
to route the updated document to the same shard as the original document
(use the same _routing value)) or unless you do not provide the correct
version-value in the update (re-indexing) operation, concurrent updates
of the same document will work correctly (one and only one will succeed)
due to the "optimistic locking" feature
When updating I need to be able to find the document that has
to be updated without involving all shards, or else I will
not be able to scale in
number-of-possible-updates-per-time-unit - that is, I will
not be able to just buy more hardware to be able to support
more updates-per-time-unit, just as I expect to be able to
support more inserts-per-time-unit by buying more hardware.
When I want to update I know that only 0 or 1 document will
exist living up to the search-criterias I will use to find
the document to be updated, and that the query will therefore
return a resultset of size 0 or 1. In order to not involve
all shards for such queries, there need to be some kind of
configuration (the same as the one controlling the
destination of a new document among shards) that ES is able
to take into consideration when performing the search - only
ask the one shard where it know the document will exist if it
exists. What kind solutions do you have in this area? It this
possible? Only on id's of the documents? Or?
Use routing to make sure that the "query" goes to the one shard where
you know that the document exist. If you not using custom routing you
need to make sure that you "query" on type and _id (the default routing
Answer already received in
To update a document, you read document, make changes to
document, index document. Optimistic concurrency is supported
Comment to answer: Ok, as I understand your answer there IS
such a concept as "update" (in RDMS terminology) in ES. I
thought that indexing a document would always be considered
as an "insert" (in RDMS terminology). As I understand you the
"index" operation in ES can be used for both "inserting" and
"updating". But that requires that ES is able to see if a
document you try to index is a "new" document or an "updated"
version of an existing document. Who does ES know if it is
one or the other by looking at the document?
If type and _id matches an existing document in the index the
"index"-operations will be considered an "update"-operation on that
---- Status ----
- Guess routing can be used to make sure that not all shards
will be contacted in order to find a specific document to be
Questions still awaiting answers:
- As I understand the answer above, the "index" operation can
be used for both "insert" and "update" (in RDMS terminology)
of documents. How does ES know when a document sent for
indexing is a new version of an already existing document and
when it is actually a new document? Is it based on the value
of the id of the document, or the fact that an version field
exists, or ...?
It can check if the document already exists and what its version
is when indexing the doc against the index (in real time manner).
You are still not answering the question - probably because the
answer is so obvious to you that it is not worth answering :-) My
question is about how ES knows whether a document I send for
indexing is a new document or if it is an updated version of an
existing document. Basically a document sent for indexing is just
some JSON sent over HTTP, there is nothing physical involved in
both the get-operation and the index-operation that tells ES that
the document sent for indexing is actually an updated version of
the document just retrieved using the get-operation. I just want
to know the set of fields that ES uses to find out that a document
sent to it is an updated version of an existing document. I guess
that the answer is that a document is a new version of an existing
document iff the type/_id corresponds to a document already
existing in the index.
I already answered that. The type/id ends up being the unique
identifier of a document within an index.
You didnt already answer that. But now you did. Thanks. I will
understand your answer this way: a document sent for indexing is
considered "the same as an existing document" iff it has the same values
for type and _id. Therefore an indexing operation will be considered an
"update" iff there exists an document in the index already with the same
value for type and _id, as the document being indexed. This is not so
obvious as you might think.
- As I understand the answer above, there is a
version-feature in ES enabling "optimistic locking" (if a
document has changed between the time it was read and the
time it is sent for re-indexing, the re-indexing operation
will fail). It that true?
Yes. As long as you provide the version when indexing. A typical
scenario would be to "get" a document (hits a single shard), and
index / update the document while providing the version you have
form the "get" operation.
Again (as with one of my other questions) I guess this will only
work if I (as a programmer of apps operating against ES) make sure
to use the same routing value then I do the original indexing of a
document, and when I do the update-indexing of the same document.
You don't have to use custom routing value.
By default, the routing value is the id of hte document (which you
have to provide when updating a document). If you do provide a custom
routing value, then you need to make sure to provide the same one when
you want to update the document.
As stated somewhere else, I would state this clarly in the documentation
- Can you please provide me with a code example, first
indexing a new document, then finding that document again for
updating and re-indexing. Please including "optimistic
locking" feature enabled (if it needs to be so explicitly),
so that if the find/re-index is run concurrently in two
threads so that they both get to find/read before any of them
does re-index, then one of them will succeed and one of them
will fail. Thanks!
Regards, Per Steffensen