Elasticsearch expected performance

Hi Guys

I have started using ES as of today and I wondering being a naive user
without doing any specific tweaks to the defaults what kind of performance
I can expect to get from ES. I also have few other questions.

On a small index few hundred lines of text I am seeing very good
performance. Right now I am inserting about 100K records into ES index
using pyes and I am seeing about 1000 lines of text inserted in 3-4 seconds
? Is this close to the optimal performance ? The server running the ES is
not busy either.

  1. I would assume to get a orders of magnitude faster performance when I do
    the searches..

  2. Also about pyes, once the index is created how I can directly query the
    index without creating each time..I dont see creating a connection handle
    to existing ES index in pyes

  3. Is there any performance diff if I use pyes compared to direct server
    queries

-A

--

Guys

I am still wondering if anyone can give me some basic benchmarks on ES. I
am trying to see if I can do better than inserting 1000 records into ES
index in 3-5 seconds. Each record in my case is a 3 col text.

-Abhi

On Friday, August 10, 2012 12:41:00 PM UTC-7, Abhishek Pratap wrote:

Hi Guys

I have started using ES as of today and I wondering being a naive user
without doing any specific tweaks to the defaults what kind of performance
I can expect to get from ES. I also have few other questions.

On a small index few hundred lines of text I am seeing very good
performance. Right now I am inserting about 100K records into ES index
using pyes and I am seeing about 1000 lines of text inserted in 3-4 seconds
? Is this close to the optimal performance ? The server running the ES is
not busy either.

  1. I would assume to get a orders of magnitude faster performance when I
    do the searches..

  2. Also about pyes, once the index is created how I can directly query
    the index without creating each time..I dont see creating a connection
    handle to existing ES index in pyes

  3. Is there any performance diff if I use pyes compared to direct server
    queries

-A

--

Hi Abhishek,

  1. I would suppose this is far from the optimal performance. But of course
    performance depends on loads of factors. The most important regarding
    indexing is using the Bulk API:

If you're not using it already, you need to specify bulk=true when you call
index() on your connection. What this does (as far as I understand) is it
puts your document into a buffer which gets flushed to ES via the Bulk API
once the bulk_size is reached. You specify bulk_size when creating your
connection, and is 400 by default.

You have to take care of what happens when you have some documents in your
"buffer" for a long time. Or whether you want to exit. For example, when
you insert 1000 items with a bulk size of 400 you might find only 800 in
Elasticsearch. For that you might need to flush the bulk manually via
flush_bulk(forced=True). Or, you can refresh the index, which also flushes
your bulk via refresh(). But that will take a lot more time, and it's not
really recommended because ES automatically flushes your index each second
by default.

However, if you insert loads of data, you might be better off by disabling
automatic refresh from ES:

and do it manually from your script once indexing is done. Note that during
that time your documents won't be available for search. And you also might
want to turn automatic refresh back on again afterwards.

If you want a raw figure of indexing performance, I get 15K inserts/sec
when putting pretty standard syslog lines to ES using pyes on a relatively
high-end laptop (i7, 8GB RAM). In this case, ES config is pretty standard,
but I'm using thrift as a transport (yes, pyes supports it, you just need
to install the plugin to ES and specify the default 9500 port to your
connection settings), and also with multithreading.

  1. It depends on how your searches look like, but it's very fast. On the
    same laptop I get sub-second query times when I search my logs for getting
    the newest 100 lines, at index sizes of up to 30M documents or so. And I
    don't have SSD storage.

  2. I don't know how pyes handles this, but if you're worried about
    connection overhead, I think you should be looking at Thrift.

  3. I haven't noticed a performance penalty. But if you want a more direct
    client for Python, may I suggest mine :smiley:

https://github.com/radu-gheorghe/slimes

It's useful if you just want to look in the ES docs and apply the things
you see there directly in your Python app.

It's at a pretty early stage, but I haven't found issues so far. I'd be
glad to hear your feedback.

On Friday, August 10, 2012 10:41:00 PM UTC+3, Abhishek Pratap wrote:

Hi Guys

I have started using ES as of today and I wondering being a naive user
without doing any specific tweaks to the defaults what kind of performance
I can expect to get from ES. I also have few other questions.

On a small index few hundred lines of text I am seeing very good
performance. Right now I am inserting about 100K records into ES index
using pyes and I am seeing about 1000 lines of text inserted in 3-4 seconds
? Is this close to the optimal performance ? The server running the ES is
not busy either.

  1. I would assume to get a orders of magnitude faster performance when I
    do the searches..

  2. Also about pyes, once the index is created how I can directly query
    the index without creating each time..I dont see creating a connection
    handle to existing ES index in pyes

  3. Is there any performance diff if I use pyes compared to direct server
    queries

-A

--

Thanks a lot Radu. I will try the tweaks based on your advice and see what
I get.

-Abhi

On Monday, August 13, 2012 10:43:01 PM UTC-7, Radu Gheorghe wrote:

Hi Abhishek,

  1. I would suppose this is far from the optimal performance. But of course
    performance depends on loads of factors. The most important regarding
    indexing is using the Bulk API:

Elasticsearch Platform — Find real-time answers at scale | Elastic

If you're not using it already, you need to specify bulk=true when you
call index() on your connection. What this does (as far as I understand) is
it puts your document into a buffer which gets flushed to ES via the Bulk
API once the bulk_size is reached. You specify bulk_size when creating your
connection, and is 400 by default.

You have to take care of what happens when you have some documents in your
"buffer" for a long time. Or whether you want to exit. For example, when
you insert 1000 items with a bulk size of 400 you might find only 800 in
Elasticsearch. For that you might need to flush the bulk manually via
flush_bulk(forced=True). Or, you can refresh the index, which also flushes
your bulk via refresh(). But that will take a lot more time, and it's not
really recommended because ES automatically flushes your index each second
by default.

However, if you insert loads of data, you might be better off by disabling
automatic refresh from ES:

Elasticsearch Platform — Find real-time answers at scale | Elastic

and do it manually from your script once indexing is done. Note that
during that time your documents won't be available for search. And you also
might want to turn automatic refresh back on again afterwards.

If you want a raw figure of indexing performance, I get 15K inserts/sec
when putting pretty standard syslog lines to ES using pyes on a relatively
high-end laptop (i7, 8GB RAM). In this case, ES config is pretty standard,
but I'm using thrift as a transport (yes, pyes supports it, you just need
to install the plugin to ES and specify the default 9500 port to your
connection settings), and also with multithreading.

  1. It depends on how your searches look like, but it's very fast. On the
    same laptop I get sub-second query times when I search my logs for getting
    the newest 100 lines, at index sizes of up to 30M documents or so. And I
    don't have SSD storage.

  2. I don't know how pyes handles this, but if you're worried about
    connection overhead, I think you should be looking at Thrift.

  3. I haven't noticed a performance penalty. But if you want a more direct
    client for Python, may I suggest mine :smiley:

https://github.com/radu-gheorghe/slimes

It's useful if you just want to look in the ES docs and apply the things
you see there directly in your Python app.

It's at a pretty early stage, but I haven't found issues so far. I'd be
glad to hear your feedback.

On Friday, August 10, 2012 10:41:00 PM UTC+3, Abhishek Pratap wrote:

Hi Guys

I have started using ES as of today and I wondering being a naive user
without doing any specific tweaks to the defaults what kind of performance
I can expect to get from ES. I also have few other questions.

On a small index few hundred lines of text I am seeing very good
performance. Right now I am inserting about 100K records into ES index
using pyes and I am seeing about 1000 lines of text inserted in 3-4 seconds
? Is this close to the optimal performance ? The server running the ES is
not busy either.

  1. I would assume to get a orders of magnitude faster performance when I
    do the searches..

  2. Also about pyes, once the index is created how I can directly query
    the index without creating each time..I dont see creating a connection
    handle to existing ES index in pyes

  3. Is there any performance diff if I use pyes compared to direct server
    queries

-A

--

Radu

I am wondering how can I turn on/off indexing through Pyres ? I think most
of my questions are related to documentation on pyes but I understand this
is work in progress.

-Abhi

On Tuesday, August 14, 2012 11:11:22 AM UTC-7, Abhishek Pratap wrote:

Thanks a lot Radu. I will try the tweaks based on your advice and see what
I get.

-Abhi

On Monday, August 13, 2012 10:43:01 PM UTC-7, Radu Gheorghe wrote:

Hi Abhishek,

  1. I would suppose this is far from the optimal performance. But of
    course performance depends on loads of factors. The most important
    regarding indexing is using the Bulk API:

Elasticsearch Platform — Find real-time answers at scale | Elastic

If you're not using it already, you need to specify bulk=true when you
call index() on your connection. What this does (as far as I understand) is
it puts your document into a buffer which gets flushed to ES via the Bulk
API once the bulk_size is reached. You specify bulk_size when creating your
connection, and is 400 by default.

You have to take care of what happens when you have some documents in
your "buffer" for a long time. Or whether you want to exit. For example,
when you insert 1000 items with a bulk size of 400 you might find only 800
in Elasticsearch. For that you might need to flush the bulk manually via
flush_bulk(forced=True). Or, you can refresh the index, which also flushes
your bulk via refresh(). But that will take a lot more time, and it's not
really recommended because ES automatically flushes your index each second
by default.

However, if you insert loads of data, you might be better off by
disabling automatic refresh from ES:

Elasticsearch Platform — Find real-time answers at scale | Elastic

and do it manually from your script once indexing is done. Note that
during that time your documents won't be available for search. And you also
might want to turn automatic refresh back on again afterwards.

If you want a raw figure of indexing performance, I get 15K inserts/sec
when putting pretty standard syslog lines to ES using pyes on a relatively
high-end laptop (i7, 8GB RAM). In this case, ES config is pretty standard,
but I'm using thrift as a transport (yes, pyes supports it, you just need
to install the plugin to ES and specify the default 9500 port to your
connection settings), and also with multithreading.

  1. It depends on how your searches look like, but it's very fast. On the
    same laptop I get sub-second query times when I search my logs for getting
    the newest 100 lines, at index sizes of up to 30M documents or so. And I
    don't have SSD storage.

  2. I don't know how pyes handles this, but if you're worried about
    connection overhead, I think you should be looking at Thrift.

  3. I haven't noticed a performance penalty. But if you want a more direct
    client for Python, may I suggest mine :smiley:

https://github.com/radu-gheorghe/slimes

It's useful if you just want to look in the ES docs and apply the things
you see there directly in your Python app.

It's at a pretty early stage, but I haven't found issues so far. I'd be
glad to hear your feedback.

On Friday, August 10, 2012 10:41:00 PM UTC+3, Abhishek Pratap wrote:

Hi Guys

I have started using ES as of today and I wondering being a naive user
without doing any specific tweaks to the defaults what kind of performance
I can expect to get from ES. I also have few other questions.

On a small index few hundred lines of text I am seeing very good
performance. Right now I am inserting about 100K records into ES index
using pyes and I am seeing about 1000 lines of text inserted in 3-4 seconds
? Is this close to the optimal performance ? The server running the ES is
not busy either.

  1. I would assume to get a orders of magnitude faster performance when I
    do the searches..

  2. Also about pyes, once the index is created how I can directly query
    the index without creating each time..I dont see creating a connection
    handle to existing ES index in pyes

  3. Is there any performance diff if I use pyes compared to direct server
    queries

-A

--

forgot to mention : I did try refresh_interval=-1l when I call the index
(conn.index()) but get an unexpected keyword argument error.

-A

On Tuesday, August 14, 2012 3:52:58 PM UTC-7, Abhishek Pratap wrote:

Radu

I am wondering how can I turn on/off indexing through Pyres ? I think most
of my questions are related to documentation on pyes but I understand this
is work in progress.

-Abhi

On Tuesday, August 14, 2012 11:11:22 AM UTC-7, Abhishek Pratap wrote:

Thanks a lot Radu. I will try the tweaks based on your advice and see
what I get.

-Abhi

On Monday, August 13, 2012 10:43:01 PM UTC-7, Radu Gheorghe wrote:

Hi Abhishek,

  1. I would suppose this is far from the optimal performance. But of
    course performance depends on loads of factors. The most important
    regarding indexing is using the Bulk API:

Elasticsearch Platform — Find real-time answers at scale | Elastic

If you're not using it already, you need to specify bulk=true when you
call index() on your connection. What this does (as far as I understand) is
it puts your document into a buffer which gets flushed to ES via the Bulk
API once the bulk_size is reached. You specify bulk_size when creating your
connection, and is 400 by default.

You have to take care of what happens when you have some documents in
your "buffer" for a long time. Or whether you want to exit. For example,
when you insert 1000 items with a bulk size of 400 you might find only 800
in Elasticsearch. For that you might need to flush the bulk manually via
flush_bulk(forced=True). Or, you can refresh the index, which also flushes
your bulk via refresh(). But that will take a lot more time, and it's not
really recommended because ES automatically flushes your index each second
by default.

However, if you insert loads of data, you might be better off by
disabling automatic refresh from ES:

Elasticsearch Platform — Find real-time answers at scale | Elastic

and do it manually from your script once indexing is done. Note that
during that time your documents won't be available for search. And you also
might want to turn automatic refresh back on again afterwards.

If you want a raw figure of indexing performance, I get 15K inserts/sec
when putting pretty standard syslog lines to ES using pyes on a relatively
high-end laptop (i7, 8GB RAM). In this case, ES config is pretty standard,
but I'm using thrift as a transport (yes, pyes supports it, you just need
to install the plugin to ES and specify the default 9500 port to your
connection settings), and also with multithreading.

  1. It depends on how your searches look like, but it's very fast. On the
    same laptop I get sub-second query times when I search my logs for getting
    the newest 100 lines, at index sizes of up to 30M documents or so. And I
    don't have SSD storage.

  2. I don't know how pyes handles this, but if you're worried about
    connection overhead, I think you should be looking at Thrift.

  3. I haven't noticed a performance penalty. But if you want a more
    direct client for Python, may I suggest mine :smiley:

https://github.com/radu-gheorghe/slimes

It's useful if you just want to look in the ES docs and apply the things
you see there directly in your Python app.

It's at a pretty early stage, but I haven't found issues so far. I'd be
glad to hear your feedback.

On Friday, August 10, 2012 10:41:00 PM UTC+3, Abhishek Pratap wrote:

Hi Guys

I have started using ES as of today and I wondering being a naive user
without doing any specific tweaks to the defaults what kind of performance
I can expect to get from ES. I also have few other questions.

On a small index few hundred lines of text I am seeing very good
performance. Right now I am inserting about 100K records into ES index
using pyes and I am seeing about 1000 lines of text inserted in 3-4 seconds
? Is this close to the optimal performance ? The server running the ES is
not busy either.

  1. I would assume to get a orders of magnitude faster performance when
    I do the searches..

  2. Also about pyes, once the index is created how I can directly query
    the index without creating each time..I dont see creating a connection
    handle to existing ES index in pyes

  3. Is there any performance diff if I use pyes compared to direct
    server queries

-A

--

I have no idea how/if you can change the refresh interval through pyes. I
was thinking about doing it from the command line if you only need to do it
one time.

Something like:

curl -XPUT 'localhost:9200/my_index/_settings' -d '{

"index" : {
    "refresh_interval" : -1
}

}'

You can skip the "my_index" part if you want to apply the setting to all
your indices.

On Wednesday, August 15, 2012 1:57:46 AM UTC+3, Abhishek Pratap wrote:

forgot to mention : I did try refresh_interval=-1l when I call the index
(conn.index()) but get an unexpected keyword argument error.

-A

On Tuesday, August 14, 2012 3:52:58 PM UTC-7, Abhishek Pratap wrote:

Radu

I am wondering how can I turn on/off indexing through Pyres ? I think
most of my questions are related to documentation on pyes but I understand
this is work in progress.

-Abhi

On Tuesday, August 14, 2012 11:11:22 AM UTC-7, Abhishek Pratap wrote:

Thanks a lot Radu. I will try the tweaks based on your advice and see
what I get.

-Abhi

On Monday, August 13, 2012 10:43:01 PM UTC-7, Radu Gheorghe wrote:

Hi Abhishek,

  1. I would suppose this is far from the optimal performance. But of
    course performance depends on loads of factors. The most important
    regarding indexing is using the Bulk API:

Elasticsearch Platform — Find real-time answers at scale | Elastic

If you're not using it already, you need to specify bulk=true when you
call index() on your connection. What this does (as far as I understand) is
it puts your document into a buffer which gets flushed to ES via the Bulk
API once the bulk_size is reached. You specify bulk_size when creating your
connection, and is 400 by default.

You have to take care of what happens when you have some documents in
your "buffer" for a long time. Or whether you want to exit. For example,
when you insert 1000 items with a bulk size of 400 you might find only 800
in Elasticsearch. For that you might need to flush the bulk manually via
flush_bulk(forced=True). Or, you can refresh the index, which also flushes
your bulk via refresh(). But that will take a lot more time, and it's not
really recommended because ES automatically flushes your index each second
by default.

However, if you insert loads of data, you might be better off by
disabling automatic refresh from ES:

Elasticsearch Platform — Find real-time answers at scale | Elastic

and do it manually from your script once indexing is done. Note that
during that time your documents won't be available for search. And you also
might want to turn automatic refresh back on again afterwards.

If you want a raw figure of indexing performance, I get 15K inserts/sec
when putting pretty standard syslog lines to ES using pyes on a relatively
high-end laptop (i7, 8GB RAM). In this case, ES config is pretty standard,
but I'm using thrift as a transport (yes, pyes supports it, you just need
to install the plugin to ES and specify the default 9500 port to your
connection settings), and also with multithreading.

  1. It depends on how your searches look like, but it's very fast. On
    the same laptop I get sub-second query times when I search my logs for
    getting the newest 100 lines, at index sizes of up to 30M documents or so.
    And I don't have SSD storage.

  2. I don't know how pyes handles this, but if you're worried about
    connection overhead, I think you should be looking at Thrift.

  3. I haven't noticed a performance penalty. But if you want a more
    direct client for Python, may I suggest mine :smiley:

https://github.com/radu-gheorghe/slimes

It's useful if you just want to look in the ES docs and apply the
things you see there directly in your Python app.

It's at a pretty early stage, but I haven't found issues so far. I'd be
glad to hear your feedback.

On Friday, August 10, 2012 10:41:00 PM UTC+3, Abhishek Pratap wrote:

Hi Guys

I have started using ES as of today and I wondering being a naive user
without doing any specific tweaks to the defaults what kind of performance
I can expect to get from ES. I also have few other questions.

On a small index few hundred lines of text I am seeing very good
performance. Right now I am inserting about 100K records into ES index
using pyes and I am seeing about 1000 lines of text inserted in 3-4 seconds
? Is this close to the optimal performance ? The server running the ES is
not busy either.

  1. I would assume to get a orders of magnitude faster performance when
    I do the searches..

  2. Also about pyes, once the index is created how I can directly query
    the index without creating each time..I dont see creating a connection
    handle to existing ES index in pyes

  3. Is there any performance diff if I use pyes compared to direct
    server queries

-A

--