Transactional ACID features in ES

Extracted from
http://groups.google.com/group/elasticsearch/browse_thread/thread/cbd2cc71c407e435
and modified to be more concrete

What kind of transactional ACID features does ES support? It would be
nice to have have all the ACID-properties in a transaction spanning the
entire work of an indexing-process (some code that I will write doing a
number of index-operations against ES). I will not bother you with the
Isolation and Durability aspects here. But I will bother you with the
Atomicity and Consistency aspects.

Atomicity. If I have an indexing-process (doing bulk
(http://www.elasticsearch.org/guide/reference/java-api/bulk.html)
indexing of many documents) do I get the Atomicity feature of ACID
transactions? Put in another way - do I end in a state where "all the
documents or non of the documents have been indexed", when I call
"execute"? I guess not, since the example on
http://www.elasticsearch.org/guide/reference/java-api/bulk.html has a
comment "process failures by iterating through each bulk response item",
indicating that I will have detailed information back about which
documents where successfully indexed and which where not. Is that correct?

Consistency. I know at least 3 features in ES that will require speciel
attention in the ES-implementation in order to also work when working
with documents concurrently from many processes:
a) Making sure that there are never a violation of the unique constraint
on type/_id of documents in an index. Will the unique constaint
implementation on type/_id work correctly if many concurrent processes
try to index new documents with the same values on type and _id? Also if
the different processes use routing, so that the new documents with the
same values on type and _id, are actually not routed to the same shard
(and therefore potentially not the same node)? How well has this been
tested?
b) Making sure that the "optimistic locking"
(http://www.elasticsearch.org/blog/2011/02/08/versioning.html)
implemented around updating (re-indexing) of documents works. Will the
"optimistic locking" work correctly if many concurrent processes try to
update an existing document concurrently? Put in another way, it is
guranteed, if 100 processes in the same split-sec tries to update an
existing document, that one and only one of those processes will succeed
and the other 99 processes will fail (with HTTP error code 409). How
well has this been tested?
c) Same as b) above, but with deleting instead of updating (re-indexing)
and with HTTP error code 404 instead of 409.

Regards, Per Steffensen

On Mon, Sep 12, 2011 at 3:55 PM, Per Steffensen steff@designware.dk wrote:

Extracted from http://groups.google.com/**group/elasticsearch/browse_**
thread/thread/cbd2cc71c407e435http://groups.google.com/group/elasticsearch/browse_thread/thread/cbd2cc71c407e435and modified to be more concrete

What kind of transactional ACID features does ES support? It would be nice
to have have all the ACID-properties in a transaction spanning the entire
work of an indexing-process (some code that I will write doing a number of
index-operations against ES). I will not bother you with the Isolation and
Durability aspects here. But I will bother you with the Atomicity and
Consistency aspects.

Atomicity. If I have an indexing-process (doing bulk (
Elasticsearch Platform — Find real-time answers at scale | Elastichttp://www.elasticsearch.org/guide/reference/java-api/bulk.html)
indexing of many documents) do I get the Atomicity feature of ACID
transactions? Put in another way - do I end in a state where "all the
documents or non of the documents have been indexed", when I call "execute"?
I guess not, since the example on http://www.elasticsearch.org/**
guide/reference/java-api/bulk.**htmlhttp://www.elasticsearch.org/guide/reference/java-api/bulk.htmlhas a comment "process failures by iterating through each bulk response
item", indicating that I will have detailed information back about which
documents where successfully indexed and which where not. Is that correct?

Yes. Atomicity is per document.

Consistency. I know at least 3 features in ES that will require speciel
attention in the ES-implementation in order to also work when working with
documents concurrently from many processes:
a) Making sure that there are never a violation of the unique constraint on
type/_id of documents in an index. Will the unique constaint implementation
on type/_id work correctly if many concurrent processes try to index new
documents with the same values on type and _id? Also if the different
processes use routing, so that the new documents with the same values on
type and _id, are actually not routed to the same shard (and therefore
potentially not the same node)? How well has this been tested?

It handles concurrent updates.

b) Making sure that the "optimistic locking" (
Elasticsearch Platform — Find real-time answers at scale | Elastichttp://www.elasticsearch.org/blog/2011/02/08/versioning.html)
implemented around updating (re-indexing) of documents works. Will the
"optimistic locking" work correctly if many concurrent processes try to
update an existing document concurrently? Put in another way, it is
guranteed, if 100 processes in the same split-sec tries to update an
existing document, that one and only one of those processes will succeed and
the other 99 processes will fail (with HTTP error code 409). How well has
this been tested?

Yes.

c) Same as b) above, but with deleting instead of updating (re-indexing)
and with HTTP error code 404 instead of 409.

Yes.

Regards, Per Steffensen

I will provide that answers to my questions below, as I understand the
answers should be - if order to make it easy for future readers of this
thread to find their answers here.

Shay Banon skrev:

On Mon, Sep 12, 2011 at 3:55 PM, Per Steffensen <steff@designware.dk
mailto:steff@designware.dk> wrote:

Extracted from
http://groups.google.com/group/elasticsearch/browse_thread/thread/cbd2cc71c407e435
and modified to be more concrete

What kind of transactional ACID features does ES support? It would
be nice to have have all the ACID-properties in a transaction
spanning the entire work of an indexing-process (some code that I
will write doing a number of index-operations against ES). I will
not bother you with the Isolation and Durability aspects here. But
I will bother you with the Atomicity and Consistency aspects.

Atomicity. If I have an indexing-process (doing bulk
(http://www.elasticsearch.org/guide/reference/java-api/bulk.html)
indexing of many documents) do I get the Atomicity feature of ACID
transactions? Put in another way - do I end in a state where "all
the documents or non of the documents have been indexed", when I
call "execute"?

No, some index-operations might have succeeded and some might have
failed. The BulkResponse time reveal.

I guess not, since the example on
http://www.elasticsearch.org/guide/reference/java-api/bulk.html
has a comment "process failures by iterating through each bulk
response item", indicating that I will have detailed information
back about which documents where successfully indexed and which
where not. Is that correct?

Yes. Atomicity is per document.
Thanks. I will look into the structure of class BulkResponse in order to
figure out exactly how to find our which documents where successfully
indexed and which where not.

Consistency. I know at least 3 features in ES that will require
speciel attention in the ES-implementation in order to also work
when working with documents concurrently from many processes:
a) Making sure that there are never a violation of the unique
constraint on type/_id of documents in an index. Will the unique
constaint implementation on type/_id work correctly if many
concurrent processes try to index new documents with the same
values on type and _id?

Will assume that the answer "It handles concurrent updates." is for this
part of the question.

Also if the different processes use custom routing, so that the
new documents with the same values on type and _id, are actually
not routed to the same shard (and therefore potentially not the
same node)?

That will NOT work. That is - the unique constraint on type/_id will NOT
be enforced if two processes try to index two different documents with
the same values for type and _id, if custom routing is used and the two
documents happen to be routed to two different shards/nodes. This
problem exists independently of whether or not the two processes work
concurrently or separated in time. I would state this problem as a
warning in bold in the documentation about routing.

How well has this been tested?

No anwser. Will test this myself.

It handles concurrent updates.

b) Making sure that the "optimistic locking"
(http://www.elasticsearch.org/blog/2011/02/08/versioning.html)
implemented around updating (re-indexing) of documents works. Will
the "optimistic locking" work correctly if many concurrent
processes try to update an existing document concurrently?

Will assume that the answer "Yes" is for this part of the question
(since it is the only question that can be answered with a simple yes).
But the "yes" is with some "but only if you make sure to ..." - see
http://groups.google.com/group/elasticsearch/browse_thread/thread/eed4dc3606a031ed

Put in another way, it is guranteed, if 100 processes in the same
split-sec tries to update an existing document, that one and only
one of those processes will succeed and the other 99 processes
will fail (with HTTP error code 409). How well has this been tested?

No answer. Will trust that it works, but will probably test this a
little bit myself.

Yes.

c) Same as b) above, but with deleting instead of updating
(re-indexing) and with HTTP error code 404 instead of 409.

Yes.

Regards, Per Steffensen

On Tue, Sep 13, 2011 at 12:03 PM, Per Steffensen steff@designware.dkwrote:

**
I will provide that answers to my questions below, as I understand the
answers should be - if order to make it easy for future readers of this
thread to find their answers here.

Shay Banon skrev:

On Mon, Sep 12, 2011 at 3:55 PM, Per Steffensen steff@designware.dkwrote:

Extracted from
http://groups.google.com/group/elasticsearch/browse_thread/thread/cbd2cc71c407e435and modified to be more concrete

What kind of transactional ACID features does ES support? It would be nice
to have have all the ACID-properties in a transaction spanning the entire
work of an indexing-process (some code that I will write doing a number of
index-operations against ES). I will not bother you with the Isolation and
Durability aspects here. But I will bother you with the Atomicity and
Consistency aspects.

Atomicity. If I have an indexing-process (doing bulk (
Elasticsearch Platform — Find real-time answers at scale | Elastic) indexing
of many documents) do I get the Atomicity feature of ACID transactions? Put
in another way - do I end in a state where "all the documents or non of the
documents have been indexed", when I call "execute"?

No, some index-operations might have succeeded and some might have
failed. The BulkResponse time reveal.

The answer is provided below, by stating that the atomicity is per document.

I guess not, since the example on

Elasticsearch Platform — Find real-time answers at scale | Elastic has a
comment "process failures by iterating through each bulk response item",
indicating that I will have detailed information back about which documents
where successfully indexed and which where not. Is that correct?

Yes. Atomicity is per document.

Thanks. I will look into the structure of class BulkResponse in order to
figure out exactly how to find our which documents where successfully
indexed and which where not.

I hope you have a PHD, otherwise, its hard to read the javadoc / read the
API doc on the site.

Consistency. I know at least 3 features in ES that will require speciel
attention in the ES-implementation in order to also work when working with
documents concurrently from many processes:
a) Making sure that there are never a violation of the unique constraint
on type/_id of documents in an index. Will the unique constaint
implementation on type/_id work correctly if many concurrent processes try
to index new documents with the same values on type and _id?

Will assume that the answer "It handles concurrent updates." is for this
part of the question.

Covered in the Yes the bottom, you just ask the same question 3 times in 3
different places.

Also if the different processes use custom routing, so that the new

documents with the same values on type and _id, are actually not routed to
the same shard (and therefore potentially not the same node)?

That will NOT work. That is - the unique constraint on type/_id will NOT
be enforced if two processes try to index two different documents with the
same values for type and _id, if custom routing is used and the two
documents happen to be routed to two different shards/nodes. This problem
exists independently of whether or not the two processes work concurrently
or separated in time. I would state this problem as a warning in bold in the
documentation about routing.

As explained before.

How well has this been tested?

No anwser. Will test this myself.

What type of answer are you expecting here? There are unit tests,
integration tests, and obviously users who use the feature and projects in
production. But go ahead, test it yourself.

It handles concurrent updates.

b) Making sure that the "optimistic locking" (
Elasticsearch Platform — Find real-time answers at scale | Elastic) implemented
around updating (re-indexing) of documents works. Will the "optimistic
locking" work correctly if many concurrent processes try to update an
existing document concurrently?

Will assume that the answer "Yes" is for this part of the question
(since it is the only question that can be answered with a simple yes). But
the "yes" is with some "but only if you make sure to ..." - see
http://groups.google.com/group/elasticsearch/browse_thread/thread/eed4dc3606a031ed

Its covered in the Yes below, you just ask it several times.

Put in another way, it is guranteed, if 100 processes in the same

split-sec tries to update an existing document, that one and only one of
those processes will succeed and the other 99 processes will fail (with HTTP
error code 409). How well has this been tested?

No answer. Will trust that it works, but will probably test this a
little bit myself.

See the answer to the other question on testing.

Yes.

c) Same as b) above, but with deleting instead of updating (re-indexing)
and with HTTP error code 404 instead of 409.

Yes.

Regards, Per Steffensen

Shay Banon skrev:

On Tue, Sep 13, 2011 at 12:03 PM, Per Steffensen <steff@designware.dk
mailto:steff@designware.dk> wrote:

I will provide that answers to my questions below, as I understand
the answers should be - if order to make it easy for future
readers of this thread to find their answers here.

Shay Banon skrev:
On Mon, Sep 12, 2011 at 3:55 PM, Per Steffensen
<steff@designware.dk <mailto:steff@designware.dk>> wrote:

    Extracted from
    http://groups.google.com/group/elasticsearch/browse_thread/thread/cbd2cc71c407e435
    and modified to be more concrete

    What kind of transactional ACID features does ES support? It
    would be nice to have have all the ACID-properties in a
    transaction spanning the entire work of an indexing-process
    (some code that I will write doing a number of
    index-operations against ES). I will not bother you with the
    Isolation and Durability aspects here. But I will bother you
    with the Atomicity and Consistency aspects.

    Atomicity. If I have an indexing-process (doing bulk
    (http://www.elasticsearch.org/guide/reference/java-api/bulk.html)
    indexing of many documents) do I get the Atomicity feature of
    ACID transactions? Put in another way - do I end in a state
    where "all the documents or non of the documents have been
    indexed", when I call "execute"?
No, some index-operations might have succeeded and some might have
failed. The BulkResponse time reveal.

The answer is provided below, by stating that the atomicity is per
document.
Yes I know. I just provided the answer here in order for other readers
to be able to find it where it should have been.

    I guess not, since the example on
    http://www.elasticsearch.org/guide/reference/java-api/bulk.html
    has a comment "process failures by iterating through each
    bulk response item", indicating that I will have detailed
    information back about which documents where successfully
    indexed and which where not. Is that correct?


Yes. Atomicity is per document.
Thanks. I will look into the structure of class BulkResponse in
order to figure out exactly how to find our which documents where
successfully indexed and which where not.

I hope you have a PHD, otherwise, its hard to read the javadoc / read
the API doc on the site.
Im sure I will manage.

    Consistency. I know at least 3 features in ES that will
    require speciel attention in the ES-implementation in order
    to also work when working with documents concurrently from
    many processes:
    a) Making sure that there are never a violation of the unique
    constraint on type/_id of documents in an index. Will the
    unique constaint implementation on type/_id work correctly if
    many concurrent processes try to index new documents with the
    same values on type and _id?
Will assume that the answer "It handles concurrent updates." is
for this part of the question.

Covered in the Yes the bottom, you just ask the same question 3 times
in 3 different places.
No I asked 3 different questions - one about consistency when creating
NEW documents (with same type/_id) concurrently - one about when
updating existing documents concurrently - and one about when deleting
existing documents concurrently. It might be that the semantics around
the consistency on those 3 areas where different - and actually they
showed up to be a little bit.

The first part of the "NEW documents" here where about whether is was
handled or not assuming that all "NEW documents" (with the same
type/_id) where routed (e.g. by not using custom routing) to the same shard.

    Also if the different processes use custom routing, so that
    the new documents with the same values on type and _id, are
    actually not routed to the same shard (and therefore
    potentially not the same node)?

This was a followup question, focusing explicitly on the case there the
"NEW documents" (with the same type/_id) where NOT routed to the same
shard (using custom routing). It showed to be a qualified separation of
the overall question around "NEW documents" because the answer is
actually different for the two cases - it is working when "NEW
documents" (with the same type/_id) is routed to the same shard, and it
is NOT working if they are not routed to the same shard.

That will NOT work. That is - the unique constraint on type/_id
will NOT be enforced if two processes try to index two different
documents with the same values for type and _id, if custom routing
is used and the two documents happen to be routed to two different
shards/nodes. This problem exists independently of whether or not
the two processes work concurrently or separated in time. I would
state this problem as a warning in bold in the documentation about
routing.

As explained before.

    How well has this been tested?
No anwser. Will test this myself.

What type of answer are you expecting here? There are unit tests,
integration tests, and obviously users who use the feature and
projects in production.
That would have been a nice answer, yes, but that will not prevent me
from testing it myself, since neither unit tests nor integration tests
often focus on having many threads do stuff that potentially interfere
with each others work. So stating that you have unit tests and
integrations tests actually does not make me confident that this kind of
potential problems with race conditions have been tested and anchored in
automatic continous tests.
But go ahead, test it yourself.
I will unless you tell me (or I read in the code later) that you have
tests that make many threads try to do interfereing stuff like this in
parallel - tests that assert that one and only one succeed.

It handles concurrent updates.
 

    b) Making sure that the "optimistic locking"
    (http://www.elasticsearch.org/blog/2011/02/08/versioning.html)
    implemented around updating (re-indexing) of documents works.
    Will the "optimistic locking" work correctly if many
    concurrent processes try to update an existing document
    concurrently?
Will assume that the answer "Yes" is for this part of the question
(since it is the only question that can be answered with a simple
yes). But the "yes" is with some "but only if you make sure to
..." - see
http://groups.google.com/group/elasticsearch/browse_thread/thread/eed4dc3606a031ed

Its covered in the Yes below, you just ask it several times.
You are right. Or at least I deepened the question in sentence no two.
Sorry.

    Put in another way, it is guranteed, if 100 processes in the
    same split-sec tries to update an existing document, that one
    and only one of those processes will succeed and the other 99
    processes will fail (with HTTP error code 409). How well has
    this been tested?
No answer. Will trust that it works, but will probably test this a
little bit myself.

See the answer to the other question on testing.

Yes.
 

    c) Same as b) above, but with deleting instead of updating
    (re-indexing) and with HTTP error code 404 instead of 409.


Yes.
 


    Regards, Per Steffensen