- I would suppose this is far from the optimal performance. But of course
performance depends on loads of factors. The most important regarding
indexing is using the Bulk API:
If you're not using it already, you need to specify bulk=true when you call
index() on your connection. What this does (as far as I understand) is it
puts your document into a buffer which gets flushed to ES via the Bulk API
once the bulk_size is reached. You specify bulk_size when creating your
connection, and is 400 by default.
You have to take care of what happens when you have some documents in your
"buffer" for a long time. Or whether you want to exit. For example, when
you insert 1000 items with a bulk size of 400 you might find only 800 in
Elasticsearch. For that you might need to flush the bulk manually via
flush_bulk(forced=True). Or, you can refresh the index, which also flushes
your bulk via refresh(). But that will take a lot more time, and it's not
really recommended because ES automatically flushes your index each second
However, if you insert loads of data, you might be better off by disabling
automatic refresh from ES:
and do it manually from your script once indexing is done. Note that during
that time your documents won't be available for search. And you also might
want to turn automatic refresh back on again afterwards.
If you want a raw figure of indexing performance, I get 15K inserts/sec
when putting pretty standard syslog lines to ES using pyes on a relatively
high-end laptop (i7, 8GB RAM). In this case, ES config is pretty standard,
but I'm using thrift as a transport (yes, pyes supports it, you just need
to install the plugin to ES and specify the default 9500 port to your
connection settings), and also with multithreading.
It depends on how your searches look like, but it's very fast. On the
same laptop I get sub-second query times when I search my logs for getting
the newest 100 lines, at index sizes of up to 30M documents or so. And I
don't have SSD storage.
I don't know how pyes handles this, but if you're worried about
connection overhead, I think you should be looking at Thrift.
I haven't noticed a performance penalty. But if you want a more direct
client for Python, may I suggest mine
It's useful if you just want to look in the ES docs and apply the things
you see there directly in your Python app.
It's at a pretty early stage, but I haven't found issues so far. I'd be
glad to hear your feedback.
On Friday, August 10, 2012 10:41:00 PM UTC+3, Abhishek Pratap wrote:
I have started using ES as of today and I wondering being a naive user
without doing any specific tweaks to the defaults what kind of performance
I can expect to get from ES. I also have few other questions.
On a small index few hundred lines of text I am seeing very good
performance. Right now I am inserting about 100K records into ES index
using pyes and I am seeing about 1000 lines of text inserted in 3-4 seconds
? Is this close to the optimal performance ? The server running the ES is
not busy either.
I would assume to get a orders of magnitude faster performance when I
do the searches..
Also about pyes, once the index is created how I can directly query
the index without creating each time..I dont see creating a connection
handle to existing ES index in pyes
Is there any performance diff if I use pyes compared to direct server