Batching will certainly increase the number of documents you can index. If
you use http, with keep alive, the overhead of sending one document at a
time should not be that high. But, of course, it depends on a lot of
factors. In Java the HTTP aspect does not add a lot of overhead (the header
and such) compared to the latency of the rest of the request if you do it
right, but I am not sure how much overhead you have for HTTP in ruby and
others...
I will add batching, and people can play with it and see if they can get
better performance.
Regarding the all will fail or not. I was saying that elasticsearch will
not support this. If you do batching, the request will hit several shards
and elasticsearch will not do two phase commit across potentially many
resources (shards), especially since, by itself, two phase commit is broken
(but thats a different story) when it comes to many resources. The API will
simply return a status for each element in the batch, i.e., if it worked or
not.
cheers,
shay.banon
On Fri, Apr 16, 2010 at 7:21 PM, Eric Gaumer egaumer@gmail.com wrote:
On Fri, Apr 16, 2010 at 11:38 AM, Shay Banon <shay.banon@elasticsearch.com
wrote:
Batch submission can be added, but first, note that batch submission will
not be a transactional one (either all succeed or fail). Also,instead of
using batch submission, you can either multithread or async each actual
operation you want to do. You should get very similar result as batching
when you do it.
In a multithread situation (async or other wise), you still have to deal
with message passing (network latency etc.) semantics. I would argue that
passing batches of 1000 documents in a single thread would still be faster
than spawning 1000 threads that all submit a single document. Am I wrong?
Maybe at small batch sizes they are pretty equal but what about as the batch
size increases?
I guess I'm mainly focused on the HTTP interface and the overhead
associated with this type of messaging. Batching seems like a reasonable way
to reduce latency in this particular area but could very well create
bottlenecks elsewhere (i.e., index writing).
Even still, if multithreading is an option, then wouldn't sending batches
across each of those threads be more efficient than sending one document at
a time?
So assume I have 100 million documents of 3K each and I need to use HTTP.
I plan on using 20 threads per node using 3 node feeding cluster (BTW, this
is, without a doubt, a common scenario in enterprise search deployments).
Being able to send a batch of a few hundred documents across each
connection is going to save me a lot of HTTP calls. No?
I think the transaction semantics are reasonable. If I send a batch of 200
documents, I would expect the batch to fail or succeed as one unit otherwise
it's much harder for me to resubmit. This is generally how some of the
commercial vendors do it.
Regards,
-Eric