Batch submission?

colinsurprenant · April 16, 2010, 2:55pm

Hi,

In the Riak mailing list
http://lists.basho.com/pipermail/riak-users_lists.basho.com/2010-April/000927.html
, Eric Gaumer made the excellent suggestion of adding a batch
submission endpoint in ES to avoid the HTTP overhead when dealing with
very large amount of documents to submit.

Is this something that could be easily added in ES? What do you think?

Thanks,
Colin

kimchy · April 16, 2010, 3:38pm

Batch submission can be added, but first, note that batch submission will
not be a transactional one (either all succeed or fail). Also,instead of
using batch submission, you can either multithread or async each actual
operation you want to do. You should get very similar result as batching
when you do it.

cheers,
shay.banon

On Fri, Apr 16, 2010 at 5:55 PM, Colin Surprenant <
colin.surprenant@gmail.com> wrote:

Hi,

In the Riak mailing list

http://lists.basho.com/pipermail/riak-users_lists.basho.com/2010-April/000927.html
, Eric Gaumer made the excellent suggestion of adding a batch
submission endpoint in ES to avoid the HTTP overhead when dealing with
very large amount of documents to submit.

Is this something that could be easily added in ES? What do you think?

Thanks,
Colin

egaumer · April 16, 2010, 4:21pm

On Fri, Apr 16, 2010 at 11:38 AM, Shay Banon
shay.banon@elasticsearch.comwrote:

Batch submission can be added, but first, note that batch submission will
not be a transactional one (either all succeed or fail). Also,instead of
using batch submission, you can either multithread or async each actual
operation you want to do. You should get very similar result as batching
when you do it.

In a multithread situation (async or other wise), you still have to deal
with message passing (network latency etc.) semantics. I would argue that
passing batches of 1000 documents in a single thread would still be faster
than spawning 1000 threads that all submit a single document. Am I wrong?
Maybe at small batch sizes they are pretty equal but what about as the batch
size increases?

I guess I'm mainly focused on the HTTP interface and the overhead associated
with this type of messaging. Batching seems like a reasonable way to reduce
latency in this particular area but could very well create bottlenecks
elsewhere (i.e., index writing).

Even still, if multithreading is an option, then wouldn't sending batches
across each of those threads be more efficient than sending one document at
a time?

So assume I have 100 million documents of 3K each and I need to use HTTP.
I plan on using 20 threads per node using 3 node feeding cluster (BTW, this
is, without a doubt, a common scenario in enterprise search deployments).

Being able to send a batch of a few hundred documents across each connection
is going to save me a lot of HTTP calls. No?

I think the transaction semantics are reasonable. If I send a batch of 200
documents, I would expect the batch to fail or succeed as one unit otherwise
it's much harder for me to resubmit. This is generally how some of the
commercial vendors do it.

Regards,
-Eric

kimchy · April 16, 2010, 4:32pm

Batching will certainly increase the number of documents you can index. If
you use http, with keep alive, the overhead of sending one document at a
time should not be that high. But, of course, it depends on a lot of
factors. In Java the HTTP aspect does not add a lot of overhead (the header
and such) compared to the latency of the rest of the request if you do it
right, but I am not sure how much overhead you have for HTTP in ruby and
others...

I will add batching, and people can play with it and see if they can get
better performance.

Regarding the all will fail or not. I was saying that elasticsearch will
not support this. If you do batching, the request will hit several shards
and elasticsearch will not do two phase commit across potentially many
resources (shards), especially since, by itself, two phase commit is broken
(but thats a different story) when it comes to many resources. The API will
simply return a status for each element in the batch, i.e., if it worked or
not.

cheers,
shay.banon

On Fri, Apr 16, 2010 at 7:21 PM, Eric Gaumer egaumer@gmail.com wrote:

On Fri, Apr 16, 2010 at 11:38 AM, Shay Banon <shay.banon@elasticsearch.com

wrote:

Batch submission can be added, but first, note that batch submission will
not be a transactional one (either all succeed or fail). Also,instead of
using batch submission, you can either multithread or async each actual
operation you want to do. You should get very similar result as batching
when you do it.

In a multithread situation (async or other wise), you still have to deal
with message passing (network latency etc.) semantics. I would argue that
passing batches of 1000 documents in a single thread would still be faster
than spawning 1000 threads that all submit a single document. Am I wrong?
Maybe at small batch sizes they are pretty equal but what about as the batch
size increases?

I guess I'm mainly focused on the HTTP interface and the overhead
associated with this type of messaging. Batching seems like a reasonable way
to reduce latency in this particular area but could very well create
bottlenecks elsewhere (i.e., index writing).

Even still, if multithreading is an option, then wouldn't sending batches
across each of those threads be more efficient than sending one document at
a time?

So assume I have 100 million documents of 3K each and I need to use HTTP.
I plan on using 20 threads per node using 3 node feeding cluster (BTW, this
is, without a doubt, a common scenario in enterprise search deployments).

Being able to send a batch of a few hundred documents across each
connection is going to save me a lot of HTTP calls. No?

I think the transaction semantics are reasonable. If I send a batch of 200
documents, I would expect the batch to fail or succeed as one unit otherwise
it's much harder for me to resubmit. This is generally how some of the
commercial vendors do it.

Regards,
-Eric

egaumer · April 16, 2010, 4:40pm

On Fri, Apr 16, 2010 at 12:32 PM, Shay Banon
shay.banon@elasticsearch.comwrote:

Regarding the all will fail or not. I was saying that elasticsearch will

not support this. If you do batching, the request will hit several shards
and elasticsearch will not do two phase commit across potentially many
resources (shards), especially since, by itself, two phase commit is broken
(but thats a different story) when it comes to many resources. The API will
simply return a status for each element in the batch, i.e., if it worked or
not.

Ahh... got ya. I think as long as people understand the limitations it's
(not ideal) but okay. I think this functionality would used mainly to
bootstrap an index with some pre-existing data. Once that process is
complete, you'd typically switch to an incremental of near real-time feed
anyway (at least that's what I'd suggest).

-Eric

kimchy · April 16, 2010, 4:53pm

Sounds great!. Want to open an issue for the batch thingy?

cheers,
shay.banon

On Fri, Apr 16, 2010 at 7:40 PM, Eric Gaumer egaumer@gmail.com wrote:

On Fri, Apr 16, 2010 at 12:32 PM, Shay Banon <shay.banon@elasticsearch.com

wrote:

Regarding the all will fail or not. I was saying that elasticsearch will

not support this. If you do batching, the request will hit several shards
and elasticsearch will not do two phase commit across potentially many
resources (shards), especially since, by itself, two phase commit is broken
(but thats a different story) when it comes to many resources. The API will
simply return a status for each element in the batch, i.e., if it worked or
not.

Ahh... got ya. I think as long as people understand the limitations it's
(not ideal) but okay. I think this functionality would used mainly to
bootstrap an index with some pre-existing data. Once that process is
complete, you'd typically switch to an incremental of near real-time feed
anyway (at least that's what I'd suggest).

-Eric

egaumer · April 16, 2010, 6:04pm

On Fri, Apr 16, 2010 at 12:53 PM, Shay Banon
shay.banon@elasticsearch.comwrote:

Sounds great!. Want to open an issue for the batch thingy?

-Eric

colinsurprenant · April 20, 2010, 1:35pm

Very nice. Thanks. Will definitely run some performance tests when its
available.

I agree with Eric that the typical use-case for this would be
bootstrap an index with some pre-existing data. This is how I plan to
use it.

Having to parse the results to check the status for each element works
for me.

Colin

On Apr 16, 2:04 pm, Eric Gaumer egau...@gmail.com wrote:

On Fri, Apr 16, 2010 at 12:53 PM, Shay Banon
shay.ba...@elasticsearch.comwrote:

Sounds great!. Want to open an issue for the batch thingy?

Issues · elastic/elasticsearch · GitHub

-Eric