We've got two ES servers, one has 16 cores and 12GB of RAM (ES_MAX_MEM =
6GB), the other only two cores and 8GB of RAM (ES_MAX_MEM=3GB).
I find that the smaller machine falls over under heavy indexing, so I
tend to turn it off while indexing many docs.
I'm using REST over HTTP with the Perl API ElasticSearch.pm, which is
synchronous.
We have about 5 million docs in one index and 3 million in a second
index. Docs are about 1kB on average, with about 25 fields (one or two
nested). I setup my mapping before starting.
Running 5 - 10 processes, I start out indexing about 600 docs per
second, but after a few million docs have been indexed, this drops to
about 2-300 per second.
My client maxes out on CPU, and the load on the ES server varies between
< 1 and 15-20 (I assume when it is flushing the new docs). The ES server
is also responding to a live site during indexing.
ElasticSearch.pm uses LWP (a standard HTTP client in Perl) which is
complete and correct, but slow. I tried with a lighter HTTP client, and
got a 30% increase, and with a Memcached backed and, I forget the exact
numbers, but a HUGE increase. This was, however, on just a few thousand
docs, so not really representative.
I'm working on an async version of ElasticSearch.pm, and I want to get
the Memcached backend properly supported, so it should be interesting to
see what difference this makes.
clint
--
Web Announcements Limited is a company registered in England and Wales,
with company number 05608868, with registered address at 10 Arvon Road,
London, N5 1PR.