Partial batch updates - indexed in order? - (0.90.5)


(george_monroe) #1

We are trying to index/update and refresh in batches using
BulkRequestBuilder to improve performance. It is very important for us to
execute our statements on the ES server in the same order as we build up
the BulkRequestBuilder object. This is because some statements are
create/index operations while others are partial updates to those same
documents. Can this be guaranteed?

If I build up a BulkRequestBuilder with the following statements, will they
be executed/indexed in the same order on the server?

Batch (1-100)
(1) index/create A
(2) index/create B
(3) update A
(4) update B
(5) index/create C

Thanks!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(george_monroe) #2

Anyone? Can anyone confirm this concept in ES 0.90.5??

On Thursday, October 24, 2013 11:19:02 AM UTC-7, george_monroe wrote:

We are trying to index/update and refresh in batches using
BulkRequestBuilder to improve performance. It is very important for us to
execute our statements on the ES server in the same order as we build up
the BulkRequestBuilder object. This is because some statements are
create/index operations while others are partial updates to those same
documents. Can this be guaranteed?

If I build up a BulkRequestBuilder with the following statements, will
they be executed/indexed in the same order on the server?

Batch (1-100)
(1) index/create A
(2) index/create B
(3) update A
(4) update B
(5) index/create C

Thanks!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Brian Yoder) #3

George,

I can't speak for the ES developers, but it's been my observation that
after loading and reloading update streams of 100+ million documents using
a single thread and the BulkRequestBuilder, I find that the order they are
applied is exactly the same as the order I supply them. This is important,
since a delete followed by an index would have very different results if
applied in the reverse order.

Brian

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(george_monroe) #4

Brian!

Thanks so much for responding. It would make sense that it would be in
order, but then I started to think that the ES server would probably have
its own thread pool, which (depending on how it's implemented) would not
guarantee the same ordering.

Anyhow, thanks for adding confidence to the concept!

Cheers

On Thu, Oct 24, 2013 at 2:12 PM, InquiringMind brian.from.fl@gmail.comwrote:

George,

I can't speak for the ES developers, but it's been my observation that
after loading and reloading update streams of 100+ million documents using
a single thread and the BulkRequestBuilder, I find that the order they are
applied is exactly the same as the order I supply them. This is important,
since a delete followed by an index would have very different results if
applied in the reverse order.

Brian

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/9cQs0aZ2RfM/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Jörg Prante) #5

No, there is no guarantee the documents appear in order on each involved
node in the index.

Batches are applied sequentially on each node, and in most cases this works
fine. But due to the distributed nature of ES, a synchronization of
document indexing across nodes does not take place.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(george_monroe) #6

Jörg,

I think you misinterpreted my question. The concern is not about synchronizing data across nodes, but rather data correctness.

As Brian points out, if you built up the following batch of just two operations in the given order:

Batch (size 2)
(1) delete document with index id #A
(2) upsert document for index id #A

and, sent it to one of the nodes using the ES transport client, then you want those operations to be applied to that document in the same exact order because otherwise you get a very different result if the following order were to be applied instead:

(1) upsert document for index id #A
(2) delete document with index id #A

This is even before refreshing the index. Jörg can you confirm that this order would be preserved this way through the ES transport client?

George

On Oct 24, 2013, at 11:54 PM, "joergprante@gmail.com" joergprante@gmail.com wrote:

No, there is no guarantee the documents appear in order on each involved node in the index.

Batches are applied sequentially on each node, and in most cases this works fine. But due to the distributed nature of ES, a synchronization of document indexing across nodes does not take place.

Jörg

--
You received this message because you are subscribed to a topic in the Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/elasticsearch/9cQs0aZ2RfM/unsubscribe.
To unsubscribe from this group and all its topics, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Jörg Prante) #7

Yes, I understand. There are several aspects.

Bulks are just sequences of regular indexing, for reducing network cycles.
In ES, you have several parallel writes when indexing happens.

Within a node, in a shard, you have segments. All indexing - also deletes -
creates new data which is appended to existing files. Every once in a
while, Lucene starts to reorganize the segments with heavy I/O.The merging
of segments (the execution of inserts and deletes to get them readable by
subsequent reads) is controlled by a merge policy. ES uses multiple threads
for merging, so the segments should be sorted to get the index operations
applied in order. The default ES merge policy is tiered, not sorted, but
this can be controlled I think. Note there is no transaction control. There
is a SortMergePolicy, see

Across nodes, there is also a kind of parallel writing, which is even
harder to control, because you can have many clients sending data in
parallel to the same document, and nodes can be heavily loaded and indexing
progresses in different speeds. Also, if you use replica, you have to
guarantee that primary index is advancing the indexing in same order as the
replica indexing. Since the indexing is distributed over several nodes,
there must be additional synchronization, such as vector clocks in the
docs, so nodes can see the same order of the docs as the clients from which
they came from. Each node has to reconstruct the correct order of the docs
independent from each other.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Brian Yoder) #8

Jörg,

Across nodes, there is also a kind of parallel writing, which is even
harder to control, because you can have many clients sending data in
parallel to the same document, and nodes can be heavily loaded and indexing
progresses in different speeds. Also, if you use replica, you have to
guarantee that primary index is advancing the indexing in same order as the
replica indexing. Since the indexing is distributed over several nodes,
there must be additional synchronization, such as vector clocks in the
docs, so nodes can see the same order of the docs as the clients from which
they came from. Each node has to reconstruct the correct order of the docs
independent from each other.

Not exactly sure I understand the previous paragraphs, but when I created a
bulk loader using the BulkRequestBuilder, it is single-threaded. And when
running one instance at any given point in time to remove any change of the
client getting the updates out of order, the question remains: Does
ElasticSearch process single-threaded bulk requests in the same order?

I fully understand that my client needs to be single-threaded. And our data
comes in such that bulk updates consists of a huge number of index and
delete operations. Sometimes, their data provider creates an update by
giving me a delete request followed by a create (index to ES) request. It
would be Very Bad if that was processed as an index followed by the delete.

Brian

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(David Pilato) #9

Does ElasticSearch process single-threaded bulk requests in the same order?

IMHO It will process your request in the same order per shard.
That means that if you send an index operation on doc 1 then a delete operation on doc 1, these operations will be sent in the same order to the shard.

But if you send index doc1, delete doc1, index doc2 and delete doc2, you have no guarantee (if you have multiple shards, no routing) that index doc2 will be done after index doc1 or delete doc1.

Make sense?

--
David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 26 oct. 2013 à 04:01, InquiringMind brian.from.fl@gmail.com a écrit :

Jörg,

Across nodes, there is also a kind of parallel writing, which is even harder to control, because you can have many clients sending data in parallel to the same document, and nodes can be heavily loaded and indexing progresses in different speeds. Also, if you use replica, you have to guarantee that primary index is advancing the indexing in same order as the replica indexing. Since the indexing is distributed over several nodes, there must be additional synchronization, such as vector clocks in the docs, so nodes can see the same order of the docs as the clients from which they came from. Each node has to reconstruct the correct order of the docs independent from each other.

Not exactly sure I understand the previous paragraphs, but when I created a bulk loader using the BulkRequestBuilder, it is single-threaded. And when running one instance at any given point in time to remove any change of the client getting the updates out of order, the question remains: Does ElasticSearch process single-threaded bulk requests in the same order?

I fully understand that my client needs to be single-threaded. And our data comes in such that bulk updates consists of a huge number of index and delete operations. Sometimes, their data provider creates an update by giving me a delete request followed by a create (index to ES) request. It would be Very Bad if that was processed as an index followed by the delete.

Brian

You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Jörg Prante) #10

Brian, the bulk indexing has some stages: the bulk request construction at
the client, the submission to the cluster, and the distribution of the doc
operations to the nodes that hold the shards.

It is easy to see that if you only have one client, one node, and one
shard, and no replica, the sequence of operations in a bulk request can be
maintained.

But just think of two (single threaded) clients that do not use MVCC
(document versioning) and use exact the same bulk indexing sequence of docs
with mixed insertions and deletions at the exact same local time, they can
not be sure if operations were executed in the sequence they had submitted
to a cluster.

Jörg

On Sat, Oct 26, 2013 at 4:01 AM, InquiringMind brian.from.fl@gmail.comwrote:

Does ElasticSearch process single-threaded bulk requests in the same
order?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(george_monroe) #11

Jörg,

Thanks for responding. You seem to have a detailed grasp on the inner workings. But we need to go a little deeper. Here is the simple architecture we want to go with in Production:

  • 2 nodes
  • 5 shards per node
  • 1 single threaded client that reads messages off of a JMS queue and sends them in bulks to the ES cluster using the ES Transport Client

The bulk is built up using BulkRequestBuilder with creates/updates/deletes in the order that we need. The bulk is sent using the transport client, after which each bulk is refreshed to make it searchable. The bulk response comes back with individual responses, which the API doc says signify the order in which they are applied on the server.

The simple question we are trying to answer is: Given this architecture setup with only one single-threaded client, can we have data loss because the individual statements in the bulk might not be applied to the documents in the same order that we built it up?

We kind of need a definitive answer because we stand the chance of losing data. If the answer is yes (no guarantee and the statements may be jumbled) then I believe this fact is significant and needs to be stated in the documentation.

George

On Oct 26, 2013, at 5:54 AM, "joergprante@gmail.com" joergprante@gmail.com wrote:

Brian, the bulk indexing has some stages: the bulk request construction at the client, the submission to the cluster, and the distribution of the doc operations to the nodes that hold the shards.

It is easy to see that if you only have one client, one node, and one shard, and no replica, the sequence of operations in a bulk request can be maintained.

But just think of two (single threaded) clients that do not use MVCC (document versioning) and use exact the same bulk indexing sequence of docs with mixed insertions and deletions at the exact same local time, they can not be sure if operations were executed in the sequence they had submitted to a cluster.

Jörg

On Sat, Oct 26, 2013 at 4:01 AM, InquiringMind brian.from.fl@gmail.com wrote:
Does ElasticSearch process single-threaded bulk requests in the same order?

--
You received this message because you are subscribed to a topic in the Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/elasticsearch/9cQs0aZ2RfM/unsubscribe.
To unsubscribe from this group and all its topics, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(David Pilato) #12

How it works behind the scene?

Bulk is read (streamed) and using id (or routing if any) each operation is routed immediatly to the right shard.

That basically means that if you index doc1 and then you update it and then you delete it, all operations will be executed in this order on a shard level.

Jörg explained that if you send two concurent bulks (but it's the same case for individual requests), then you don't have a guarantee. The latest executed operation will win!

So, as you described it, I don't see any problem for your use case.

My 2 cents.

--
David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 26 oct. 2013 à 18:38, yuri.panchenko@gmail.com a écrit :

Jörg,

Thanks for responding. You seem to have a detailed grasp on the inner workings. But we need to go a little deeper. Here is the simple architecture we want to go with in Production:

  • 2 nodes
  • 5 shards per node
  • 1 single threaded client that reads messages off of a JMS queue and sends them in bulks to the ES cluster using the ES Transport Client

The bulk is built up using BulkRequestBuilder with creates/updates/deletes in the order that we need. The bulk is sent using the transport client, after which each bulk is refreshed to make it searchable. The bulk response comes back with individual responses, which the API doc says signify the order in which they are applied on the server.

The simple question we are trying to answer is: Given this architecture setup with only one single-threaded client, can we have data loss because the individual statements in the bulk might not be applied to the documents in the same order that we built it up?

We kind of need a definitive answer because we stand the chance of losing data. If the answer is yes (no guarantee and the statements may be jumbled) then I believe this fact is significant and needs to be stated in the documentation.

George

On Oct 26, 2013, at 5:54 AM, "joergprante@gmail.com" joergprante@gmail.com wrote:

Brian, the bulk indexing has some stages: the bulk request construction at the client, the submission to the cluster, and the distribution of the doc operations to the nodes that hold the shards.

It is easy to see that if you only have one client, one node, and one shard, and no replica, the sequence of operations in a bulk request can be maintained.

But just think of two (single threaded) clients that do not use MVCC (document versioning) and use exact the same bulk indexing sequence of docs with mixed insertions and deletions at the exact same local time, they can not be sure if operations were executed in the sequence they had submitted to a cluster.

Jörg

On Sat, Oct 26, 2013 at 4:01 AM, InquiringMind brian.from.fl@gmail.com wrote:

Does ElasticSearch process single-threaded bulk requests in the same order?

--
You received this message because you are subscribed to a topic in the Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/elasticsearch/9cQs0aZ2RfM/unsubscribe.
To unsubscribe from this group and all its topics, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(george_monroe) #13

Thanks David for the simple and clear explanation! I truly hope this is so. For concurrent bulks it makes sense that there would not be guarantees.

Thanks again!!

On Oct 26, 2013, at 9:53 AM, David Pilato david@pilato.fr wrote:

How it works behind the scene?

Bulk is read (streamed) and using id (or routing if any) each operation is routed immediatly to the right shard.

That basically means that if you index doc1 and then you update it and then you delete it, all operations will be executed in this order on a shard level.

Jörg explained that if you send two concurent bulks (but it's the same case for individual requests), then you don't have a guarantee. The latest executed operation will win!

So, as you described it, I don't see any problem for your use case.

My 2 cents.

--
David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 26 oct. 2013 à 18:38, yuri.panchenko@gmail.com a écrit :

Jörg,

Thanks for responding. You seem to have a detailed grasp on the inner workings. But we need to go a little deeper. Here is the simple architecture we want to go with in Production:

  • 2 nodes
  • 5 shards per node
  • 1 single threaded client that reads messages off of a JMS queue and sends them in bulks to the ES cluster using the ES Transport Client

The bulk is built up using BulkRequestBuilder with creates/updates/deletes in the order that we need. The bulk is sent using the transport client, after which each bulk is refreshed to make it searchable. The bulk response comes back with individual responses, which the API doc says signify the order in which they are applied on the server.

The simple question we are trying to answer is: Given this architecture setup with only one single-threaded client, can we have data loss because the individual statements in the bulk might not be applied to the documents in the same order that we built it up?

We kind of need a definitive answer because we stand the chance of losing data. If the answer is yes (no guarantee and the statements may be jumbled) then I believe this fact is significant and needs to be stated in the documentation.

George

On Oct 26, 2013, at 5:54 AM, "joergprante@gmail.com" joergprante@gmail.com wrote:

Brian, the bulk indexing has some stages: the bulk request construction at the client, the submission to the cluster, and the distribution of the doc operations to the nodes that hold the shards.

It is easy to see that if you only have one client, one node, and one shard, and no replica, the sequence of operations in a bulk request can be maintained.

But just think of two (single threaded) clients that do not use MVCC (document versioning) and use exact the same bulk indexing sequence of docs with mixed insertions and deletions at the exact same local time, they can not be sure if operations were executed in the sequence they had submitted to a cluster.

Jörg

On Sat, Oct 26, 2013 at 4:01 AM, InquiringMind brian.from.fl@gmail.com wrote:
Does ElasticSearch process single-threaded bulk requests in the same order?

--
You received this message because you are subscribed to a topic in the Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/elasticsearch/9cQs0aZ2RfM/unsubscribe.
To unsubscribe from this group and all its topics, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
--
You received this message because you are subscribed to a topic in the Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/elasticsearch/9cQs0aZ2RfM/unsubscribe.
To unsubscribe from this group and all its topics, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Brian Yoder) #14

Thanks, David, for the clear and concise response, and thanks Jörg for the
cool low-level details behind it.

Additionally, I fully understand that multiple simultaneous bulk load
clients issuing requests against the same document ID is not deterministic.
And I also fully understand that the indexing and on-disk / unsorted
response ordering across multiple document IDs is not deterministic across
bulk loads. Of course, that's life in the non-transactional NoSQL world.
Indeed, these are some of the reasons that I like ES so much and why it's
so fast... and when applications are designed with this in mind they fly
past transactional RDB solutions. Yay!

Thanks again!

Brian

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(David Pilato) #15

If you need to guarantee the order, then you should look at _version field. But you have to manage conflicts on a client level.

--
David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 28 oct. 2013 à 01:58, InquiringMind brian.from.fl@gmail.com a écrit :

Thanks, David, for the clear and concise response, and thanks Jörg for the cool low-level details behind it.

Additionally, I fully understand that multiple simultaneous bulk load clients issuing requests against the same document ID is not deterministic. And I also fully understand that the indexing and on-disk / unsorted response ordering across multiple document IDs is not deterministic across bulk loads. Of course, that's life in the non-transactional NoSQL world. Indeed, these are some of the reasons that I like ES so much and why it's so fast... and when applications are designed with this in mind they fly past transactional RDB solutions. Yay!

Thanks again!

Brian

You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(system) #16