Indeed, using the bulk system makes a night and day difference. Definitely
worth making the minor changes to logic in the loader. Thanks for the help.
It's very different in term of performance, especially if you have
multiple shards (5 by default).
So, use it if you can.
You can try a simple script to test performance. Insert 100 000 docs in a
loop without using bulk or using a bulk with size of 10 000 docs.
I bet you will see the gain.
a écrit :
Having never used the bulk api, I have no idea what the performance
difference is. Before rewriting your code, I'd try the configuration
tweaks first (turning off replication, changing refresh rates, etc).
On Monday, January 14, 2013 1:44:08 PM UTC-5, Omega Mike wrote:
Thanks for the link to the tips. I've seen the bulk API, which is
definitely a possibility to use, it'll just require some re-engineering of
the insertion mechanism we have, beyond the conversion from Mongo to ES, so
I was half hoping/half being lazy in hoping that a more drastic change like
that could be avoided. That being said, if the bulk performance is really
that great, the changes can obviously be made.
On Monday, January 14, 2013 11:04:40 AM UTC-6, Zachary Tong wrote:
Have you tried the bulk indexing API?
I'm not entirely familiar with PyES, but I think it implements the bulk
API too: http://www.elasticsearch.org/guide/reference/api/bulk.html
Also try some of the tips Shay recommends here: http://davedash.com/2011/02/25/bulk-load-elasticsearch-using-pyes/
On Monday, January 14, 2013 10:54:57 AM UTC-5, Omega Mike wrote:
I am currently testing ES as a replacement for MongoDB in a custom
centralized logging mechanism. Using Mongo, I am able to throughput entries
into the current instance at a rate of 500-800/second on average with peaks
of 1200/second. These are one-line log entries, broken down into JSON
objects by a Python program and inserted into a dedicated Mongo collection
for each remote logging host. All of this being said, I haven't been able
to squeeze much more performance out of MongoDB, aside from throwing more
hardware behind it (which is slightly frowned upon at the moment).
Basically, it seems as though ES will fit our purposes
more closely (especially in search performance). That being said, I have
setup an ES instance on the same hardware (MongoDB is shutdown for testing)
and while the search performance seems great, for what I've been able to
insert so far, the actual inserting or indexing performance is nowhere near
adequate. I'm currently only able to insert around 25 entries per second,
obviously nowhere near the performance of MongoDB.
I haven't been able to find any great information on tuning the
performance of inserts in ES at all, so if anyone could point me to those
that'd be awesome. Otherwise, as for my current setup, I'm using 0.20.2
(installed with the typical extract and splat method on CentOS 6), I'm
using the pyes Python library to interface with ES, the program inserting
is running locally on the same box, which is a VM with four cores @ 2.67GHz
and 4GB RAM. I'm not hitting any sort of disk limitation yet (which on the
back-end is hosted on the company SAN which has much of our production
environment) they way I have been with MongoDB.
Thanks in advance for any help anyone might be able to offer me.
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs