Indexing (insert) performance and tuning

I am currently testing ES as a replacement for MongoDB in a custom
centralized logging mechanism. Using Mongo, I am able to throughput entries
into the current instance at a rate of 500-800/second on average with peaks
of 1200/second. These are one-line log entries, broken down into JSON
objects by a Python program and inserted into a dedicated Mongo collection
for each remote logging host. All of this being said, I haven't been able
to squeeze much more performance out of MongoDB, aside from throwing more
hardware behind it (which is slightly frowned upon at the moment).

TL;DR
Basically, it seems as though ES will fit our purposes
more closely (especially in search performance). That being said, I have
setup an ES instance on the same hardware (MongoDB is shutdown for testing)
and while the search performance seems great, for what I've been able to
insert so far, the actual inserting or indexing performance is nowhere near
adequate. I'm currently only able to insert around 25 entries per second,
obviously nowhere near the performance of MongoDB.
I haven't been able to find any great information on tuning the performance
of inserts in ES at all, so if anyone could point me to those that'd be
awesome. Otherwise, as for my current setup, I'm using 0.20.2 (installed
with the typical extract and splat method on CentOS 6), I'm using the pyes
Python library to interface with ES, the program inserting is running
locally on the same box, which is a VM with four cores @ 2.67GHz and 4GB
RAM. I'm not hitting any sort of disk limitation yet (which on the back-end
is hosted on the company SAN which has much of our production environment)
they way I have been with MongoDB.

Thanks in advance for any help anyone might be able to offer me.

--

Have you tried the bulk indexing API?
Elasticsearch Platform — Find real-time answers at scale | Elastic

I'm not entirely familiar with PyES, but I think it implements the bulk API
too: Bulk load ElasticSearch using pyes | dave dash

Also try some of the tips Shay recommends here:
https://groups.google.com/d/msg/elasticsearch/APWxRLrMOeU/HxZEyY0Yx_sJ

-Zach

On Monday, January 14, 2013 10:54:57 AM UTC-5, Omega Mike wrote:

I am currently testing ES as a replacement for MongoDB in a custom
centralized logging mechanism. Using Mongo, I am able to throughput entries
into the current instance at a rate of 500-800/second on average with peaks
of 1200/second. These are one-line log entries, broken down into JSON
objects by a Python program and inserted into a dedicated Mongo collection
for each remote logging host. All of this being said, I haven't been able
to squeeze much more performance out of MongoDB, aside from throwing more
hardware behind it (which is slightly frowned upon at the moment).

TL;DR
Basically, it seems as though ES will fit our purposes
more closely (especially in search performance). That being said, I have
setup an ES instance on the same hardware (MongoDB is shutdown for testing)
and while the search performance seems great, for what I've been able to
insert so far, the actual inserting or indexing performance is nowhere near
adequate. I'm currently only able to insert around 25 entries per second,
obviously nowhere near the performance of MongoDB.
I haven't been able to find any great information on tuning the
performance of inserts in ES at all, so if anyone could point me to those
that'd be awesome. Otherwise, as for my current setup, I'm using 0.20.2
(installed with the typical extract and splat method on CentOS 6), I'm
using the pyes Python library to interface with ES, the program inserting
is running locally on the same box, which is a VM with four cores @ 2.67GHz
and 4GB RAM. I'm not hitting any sort of disk limitation yet (which on the
back-end is hosted on the company SAN which has much of our production
environment) they way I have been with MongoDB.

Thanks in advance for any help anyone might be able to offer me.

--

Thanks for the link to the tips. I've seen the bulk API, which is
definitely a possibility to use, it'll just require some re-engineering of
the insertion mechanism we have, beyond the conversion from Mongo to ES, so
I was half hoping/half being lazy in hoping that a more drastic change like
that could be avoided. That being said, if the bulk performance is really
that great, the changes can obviously be made.

On Monday, January 14, 2013 11:04:40 AM UTC-6, Zachary Tong wrote:

Have you tried the bulk indexing API?
Elasticsearch Platform — Find real-time answers at scale | Elastic

I'm not entirely familiar with PyES, but I think it implements the bulk
API too:
Bulk load ElasticSearch using pyes | dave dash

Also try some of the tips Shay recommends here:
https://groups.google.com/d/msg/elasticsearch/APWxRLrMOeU/HxZEyY0Yx_sJ

-Zach

On Monday, January 14, 2013 10:54:57 AM UTC-5, Omega Mike wrote:

I am currently testing ES as a replacement for MongoDB in a custom
centralized logging mechanism. Using Mongo, I am able to throughput entries
into the current instance at a rate of 500-800/second on average with peaks
of 1200/second. These are one-line log entries, broken down into JSON
objects by a Python program and inserted into a dedicated Mongo collection
for each remote logging host. All of this being said, I haven't been able
to squeeze much more performance out of MongoDB, aside from throwing more
hardware behind it (which is slightly frowned upon at the moment).

TL;DR
Basically, it seems as though ES will fit our purposes
more closely (especially in search performance). That being said, I have
setup an ES instance on the same hardware (MongoDB is shutdown for testing)
and while the search performance seems great, for what I've been able to
insert so far, the actual inserting or indexing performance is nowhere near
adequate. I'm currently only able to insert around 25 entries per second,
obviously nowhere near the performance of MongoDB.
I haven't been able to find any great information on tuning the
performance of inserts in ES at all, so if anyone could point me to those
that'd be awesome. Otherwise, as for my current setup, I'm using 0.20.2
(installed with the typical extract and splat method on CentOS 6), I'm
using the pyes Python library to interface with ES, the program inserting
is running locally on the same box, which is a VM with four cores @ 2.67GHz
and 4GB RAM. I'm not hitting any sort of disk limitation yet (which on the
back-end is hosted on the company SAN which has much of our production
environment) they way I have been with MongoDB.

Thanks in advance for any help anyone might be able to offer me.

--

Having never used the bulk api, I have no idea what the performance
difference is. Before rewriting your code, I'd try the configuration
tweaks first (turning off replication, changing refresh rates, etc).

Goodluck!
-Z

On Monday, January 14, 2013 1:44:08 PM UTC-5, Omega Mike wrote:

Thanks for the link to the tips. I've seen the bulk API, which is
definitely a possibility to use, it'll just require some re-engineering of
the insertion mechanism we have, beyond the conversion from Mongo to ES, so
I was half hoping/half being lazy in hoping that a more drastic change like
that could be avoided. That being said, if the bulk performance is really
that great, the changes can obviously be made.

On Monday, January 14, 2013 11:04:40 AM UTC-6, Zachary Tong wrote:

Have you tried the bulk indexing API?
Elasticsearch Platform — Find real-time answers at scale | Elastic

I'm not entirely familiar with PyES, but I think it implements the bulk
API too:
Bulk load ElasticSearch using pyes | dave dash

Also try some of the tips Shay recommends here:
https://groups.google.com/d/msg/elasticsearch/APWxRLrMOeU/HxZEyY0Yx_sJ

-Zach

On Monday, January 14, 2013 10:54:57 AM UTC-5, Omega Mike wrote:

I am currently testing ES as a replacement for MongoDB in a custom
centralized logging mechanism. Using Mongo, I am able to throughput entries
into the current instance at a rate of 500-800/second on average with peaks
of 1200/second. These are one-line log entries, broken down into JSON
objects by a Python program and inserted into a dedicated Mongo collection
for each remote logging host. All of this being said, I haven't been able
to squeeze much more performance out of MongoDB, aside from throwing more
hardware behind it (which is slightly frowned upon at the moment).

TL;DR
Basically, it seems as though ES will fit our purposes
more closely (especially in search performance). That being said, I have
setup an ES instance on the same hardware (MongoDB is shutdown for testing)
and while the search performance seems great, for what I've been able to
insert so far, the actual inserting or indexing performance is nowhere near
adequate. I'm currently only able to insert around 25 entries per second,
obviously nowhere near the performance of MongoDB.
I haven't been able to find any great information on tuning the
performance of inserts in ES at all, so if anyone could point me to those
that'd be awesome. Otherwise, as for my current setup, I'm using 0.20.2
(installed with the typical extract and splat method on CentOS 6), I'm
using the pyes Python library to interface with ES, the program inserting
is running locally on the same box, which is a VM with four cores @ 2.67GHz
and 4GB RAM. I'm not hitting any sort of disk limitation yet (which on the
back-end is hosted on the company SAN which has much of our production
environment) they way I have been with MongoDB.

Thanks in advance for any help anyone might be able to offer me.

--

It's very different in term of performance, especially if you have multiple
shards (5 by default).
So, use it if you can.

You can try a simple script to test performance. Insert 100 000 docs in a loop
without using bulk or using a bulk with size of 10 000 docs.
I bet you will see the gain.

David

Le 14 janvier 2013 à 19:59, Zachary Tong zacharyjtong@gmail.com a écrit :

Having never used the bulk api, I have no idea what the performance difference
is. Before rewriting your code, I'd try the configuration tweaks first
(turning off replication, changing refresh rates, etc).

Goodluck!
-Z

On Monday, January 14, 2013 1:44:08 PM UTC-5, Omega Mike wrote:

Thanks for the link to the tips. I've seen the bulk API, which is
definitely a possibility to use, it'll just require some re-engineering
of the insertion mechanism we have, beyond the conversion from Mongo to
ES, so I was half hoping/half being lazy in hoping that a more drastic
change like that could be avoided. That being said, if the bulk
performance is really that great, the changes can obviously be made.

On Monday, January 14, 2013 11:04:40 AM UTC-6, Zachary Tong wrote:
> > > Have you tried the bulk indexing API?
> > > Elasticsearch Platform — Find real-time answers at scale | Elastic
> > > http://www.elasticsearch.org/guide/reference/api/bulk.html

 I'm not entirely familiar with PyES, but I think it implements the

bulk API too:
http://www.elasticsearch.org/guide/reference/api/bulk.html
Bulk load ElasticSearch using pyes | dave dash
http://davedash.com/2011/02/25/bulk-load-elasticsearch-using-pyes/

 Also try some of the tips Shay recommends here:

http://davedash.com/2011/02/25/bulk-load-elasticsearch-using-pyes/
https://groups.google.com/d/msg/elasticsearch/APWxRLrMOeU/HxZEyY0Yx_sJ
https://groups.google.com/d/msg/elasticsearch/APWxRLrMOeU/HxZEyY0Yx_sJ

 -Zach



 On Monday, January 14, 2013 10:54:57 AM UTC-5, Omega Mike wrote:
   > > > > I am currently testing ES as a replacement for MongoDB in a
   > > > > custom centralized logging mechanism. Using Mongo, I am
   > > > > able to throughput entries into the current instance at a
   > > > > rate of 500-800/second on average with peaks of
   > > > > 1200/second. These are one-line log entries, broken down
   > > > > into JSON objects by a Python program and inserted into a
   > > > > dedicated Mongo collection for each remote logging host.
   > > > > All of this being said, I haven't been able to squeeze much
   > > > > more performance out of MongoDB, aside from throwing more
   > > > > hardware behind it (which is slightly frowned upon at the
   > > > > moment).
   TL;DR
   Basically, it seems as though ES will fit our purposes more

closely (especially in search performance). That being said, I have
setup an ES instance on the same hardware (MongoDB is shutdown for
testing) and while the search performance seems great, for what I've
been able to insert so far, the actual inserting or indexing performance
is nowhere near adequate. I'm currently only able to insert around 25
entries per second, obviously nowhere near the performance of MongoDB.
I haven't been able to find any great information on tuning the
performance of inserts in ES at all, so if anyone could point me to
those that'd be awesome. Otherwise, as for my current setup, I'm using
0.20.2 (installed with the typical extract and splat method on CentOS
6), I'm using the pyes Python library to interface with ES, the program
inserting is running locally on the same box, which is a VM with four
cores @ 2.67GHz and 4GB RAM. I'm not hitting any sort of disk limitation
yet (which on the back-end is hosted on the company SAN which has much
of our production environment) they way I have been with MongoDB.

   Thanks in advance for any help anyone might be able to offer me.
 > > >    > >  > 

--

https://groups.google.com/d/msg/elasticsearch/APWxRLrMOeU/HxZEyY0Yx_sJ

https://groups.google.com/d/msg/elasticsearch/APWxRLrMOeU/HxZEyY0Yx_sJ

--
David Pilato
http://www.scrutmydocs.org/
http://dev.david.pilato.fr/
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

--

Indeed, using the bulk system makes a night and day difference. Definitely
worth making the minor changes to logic in the loader. Thanks for the help.

-Michael

On Monday, January 14, 2013 2:52:08 PM UTC-6, David Pilato wrote:

It's very different in term of performance, especially if you have
multiple shards (5 by default).
So, use it if you can.

You can try a simple script to test performance. Insert 100 000 docs in a
loop without using bulk or using a bulk with size of 10 000 docs.
I bet you will see the gain.

David

Le 14 janvier 2013 à 19:59, Zachary Tong <zachar...@gmail.com<javascript:>>
a écrit :

Having never used the bulk api, I have no idea what the performance
difference is. Before rewriting your code, I'd try the configuration
tweaks first (turning off replication, changing refresh rates, etc).

Goodluck!
-Z

On Monday, January 14, 2013 1:44:08 PM UTC-5, Omega Mike wrote:

Thanks for the link to the tips. I've seen the bulk API, which is
definitely a possibility to use, it'll just require some re-engineering of
the insertion mechanism we have, beyond the conversion from Mongo to ES, so
I was half hoping/half being lazy in hoping that a more drastic change like
that could be avoided. That being said, if the bulk performance is really
that great, the changes can obviously be made.

On Monday, January 14, 2013 11:04:40 AM UTC-6, Zachary Tong wrote:

Have you tried the bulk indexing API?
Elasticsearch Platform — Find real-time answers at scale | Elastic

I'm not entirely familiar with PyES, but I think it implements the bulk
API too: http://www.elasticsearch.org/guide/reference/api/bulk.html
Bulk load ElasticSearch using pyes | dave dash

Also try some of the tips Shay recommends here: http://davedash.com/2011/02/25/bulk-load-elasticsearch-using-pyes/
https://groups.google.com/d/msg/elasticsearch/APWxRLrMOeU/HxZEyY0Yx_sJ

-Zach

On Monday, January 14, 2013 10:54:57 AM UTC-5, Omega Mike wrote:

I am currently testing ES as a replacement for MongoDB in a custom
centralized logging mechanism. Using Mongo, I am able to throughput entries
into the current instance at a rate of 500-800/second on average with peaks
of 1200/second. These are one-line log entries, broken down into JSON
objects by a Python program and inserted into a dedicated Mongo collection
for each remote logging host. All of this being said, I haven't been able
to squeeze much more performance out of MongoDB, aside from throwing more
hardware behind it (which is slightly frowned upon at the moment).

TL;DR
Basically, it seems as though ES will fit our purposes
more closely (especially in search performance). That being said, I have
setup an ES instance on the same hardware (MongoDB is shutdown for testing)
and while the search performance seems great, for what I've been able to
insert so far, the actual inserting or indexing performance is nowhere near
adequate. I'm currently only able to insert around 25 entries per second,
obviously nowhere near the performance of MongoDB.
I haven't been able to find any great information on tuning the
performance of inserts in ES at all, so if anyone could point me to those
that'd be awesome. Otherwise, as for my current setup, I'm using 0.20.2
(installed with the typical extract and splat method on CentOS 6), I'm
using the pyes Python library to interface with ES, the program inserting
is running locally on the same box, which is a VM with four cores @ 2.67GHz
and 4GB RAM. I'm not hitting any sort of disk limitation yet (which on the
back-end is hosted on the company SAN which has much of our production
environment) they way I have been with MongoDB.

Thanks in advance for any help anyone might be able to offer me.

--

https://groups.google.com/d/msg/elasticsearch/APWxRLrMOeU/HxZEyY0Yx_sJ

https://groups.google.com/d/msg/elasticsearch/APWxRLrMOeU/HxZEyY0Yx_sJ

--
David Pilato
http://www.scrutmydocs.org/
http://dev.david.pilato.fr/
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

--