Hi,
I'm having 7 indices - one for each day of logging. So I'm always
inserting on the last index. Each index has 5 shards with one replica.
Last test was on 2 nodes, each with 8 cores and 20GB of RAM. The
performance test went like this:
- I'm inserting on 3 threads 5000 log lines at a time with bulk
indexing
- after inserting a million logs, I'm checking the query performance.
If 5 queries in a row return in more than 4 seconds (spikes happen, so
I wanted to rule that out), my "loading" script would stop
Up to 70M documents all went brilliant. So I left the test running
overnight expecting to see where it stops the next day.
Next day, the state was like this:
- 192M logs were indexed
- all inserts were timing out. This means it couldn't index 3x5000 log
lines in 30 seconds
- the test script was hanged in a query, so I just stopped it
When I looked in the logs, I found out that all queries were returned
in 1 second or so, and the last one in 18 seconds. So I queried for a
string, and I got no result in about 10 minutes, when I gave up.
Looking at the inserts, some of them were timing out when it passed
100M documents, but only rarely (1 out of 10-20 times).
The last thing I did was to restart both ES instances. Then a query
for a string would return in a few seconds, but a query that would do
"match_all" and sort by date (to give me the last 50 logs) would still
go for more than 10 minutes, when I've stopped it.
Memory was of course full of cache, but the CPU and disk activities
were surprisingly low (let's say under 10% of capacity). It all looked
like a hang to me.
My conclusion was that up until 190M logs, inserts were slowly getting
worse, and queries were very fast. But after that it all went
unusable. At least the queries, I haven't tried inserting less
documents.
I think that's about all the relevant information I can think of. If
you need more info, I can remake the test next week and provide some
more. I'm running on 0.18.7, but I can retry with 0.19.0 if you think
it makes a difference.
Thanks!
Radu
On Mar 1, 1:49 pm, Shay Banon kim...@gmail.com wrote:
To try and help, need more info especially number of indices / shards you have, and number of nodes you are running. Also, what is abrupt, I did not understand a thin line between 1 second and timeout, what is the timeout?
On Wednesday, February 29, 2012 at 5:01 PM, Radu Gheorghe wrote:
Hi,
After inserting a lot of documents in elasticsearch (when the index
size gets to about 4-5 times the RAM size), the performance on both
inserting and querying drops abruptly. By "abruptly", I mean there is
a thin line between returning a query in 1 second, and getting a
timeout from the browser.
It seems pretty obvious that the RAM is not enough for something (the
inverted index?), because the slower the storage, the more abrupt is
the performance drop.
I have two questions:
- what is the specific reason that makes ES go suddenly from super-
fast to super-slow?
- is there something I can do to make it performance drop less abrupt?
The point is that I'm having trouble estimating what hardware I need,
when I need to account spikes. It's kind of hard to find a balance
between budgeting 4 servers and praying it will work and budgeting 10
servers "for just in case".
Thanks in advance!
Radu
P.S. Some information about my setup:
- indexing log lines of about 1K each
- default "local" gateway
- default 1s refresh rate
- using the Python "pyes" over HTTP for both insert and query
- 1 replica per shard
- first look in the logs didn't help
But I can reproduce the problem and provide any potentially useful
info if it's needed.