Poll: Indexing speed?


(ppearcy) #1

Hi All,
Was curious what kind of indexing speed people are getting? I know
there are a lot of variables here, but the data should still be useful
to derive some trends.

Here is where I am at.

Speed: 5000 docs/minute
DocSize: 1K-4K
Cluster config: Singler server w/ 48 cores and 48GB of RAM
Indexing config: Using pyelasticsearch with 18 subprocesses. Using
subprocess module to get around CPython's GIL issues

I'm very happy with these numbers, especially knowing that using the
Java API could probably yield further speed improvements. Also, I feel
as if my indexing process is the current bottleneck, not ES.

Thanks,
Paul


(Lukáš Vlček) #2

Hi Paul,

I think you have big room for improvement. Seriously.

I can tell you that I usually index around 30.000 docs in 30-40 seconds. The
time includes also converting the documents from raw format into json. Size
of documents in raw format is from 2K to 250K (median probably around 10K)
but I do not know how much the size changes after converting into json. I
use java client and both the java client and the data node share the same 2
core machine with 4GB mem (but the java processes are given a lot less then
4GB of mem). I use 10 parallel threads for indexing but I did not spend any
time on optimization so I think these numbers can be still improved.
My time is pure indexing time, that means I do not count the time which the
java client needs to connect to the cluster.

Regards,
Lukas

On Wed, Aug 18, 2010 at 7:37 PM, Paul ppearcy@gmail.com wrote:

Hi All,
Was curious what kind of indexing speed people are getting? I know
there are a lot of variables here, but the data should still be useful
to derive some trends.

Here is where I am at.

Speed: 5000 docs/minute
DocSize: 1K-4K
Cluster config: Singler server w/ 48 cores and 48GB of RAM
Indexing config: Using pyelasticsearch with 18 subprocesses. Using
subprocess module to get around CPython's GIL issues

I'm very happy with these numbers, especially knowing that using the
Java API could probably yield further speed improvements. Also, I feel
as if my indexing process is the current bottleneck, not ES.

Thanks,
Paul


(Diptamay) #3

Hi Lucas

Could you please provide a sample document that you index at such high
speeds? How complex and nested is this json document? Do you use
dynamic mapping or have specified the mapping before hand?

My numbers are more like Paul's numbers. However I use the REST api
(which is slower), as I don't want my code to depend on ES's jars.
Also my documents are pretty nested and complex (like 4-5 levels deep
in some cases), and since our content is pretty dynamic, I am using
dynamic mapping for now.

-Diptamay

On Aug 18, 2:32 pm, Lukáš Vlček lukas.vl...@gmail.com wrote:

Hi Paul,

I think you have big room for improvement. Seriously.

I can tell you that I usually index around 30.000 docs in 30-40 seconds. The
time includes also converting the documents from raw format into json. Size
of documents in raw format is from 2K to 250K (median probably around 10K)
but I do not know how much the size changes after converting into json. I
use java client and both the java client and the data node share the same 2
core machine with 4GB mem (but the java processes are given a lot less then
4GB of mem). I use 10 parallel threads for indexing but I did not spend any
time on optimization so I think these numbers can be still improved.
My time is pure indexing time, that means I do not count the time which the
java client needs to connect to the cluster.

Regards,
Lukas

On Wed, Aug 18, 2010 at 7:37 PM, Paul ppea...@gmail.com wrote:

Hi All,
Was curious what kind of indexing speed people are getting? I know
there are a lot of variables here, but the data should still be useful
to derive some trends.

Here is where I am at.

Speed: 5000 docs/minute
DocSize: 1K-4K
Cluster config: Singler server w/ 48 cores and 48GB of RAM
Indexing config: Using pyelasticsearch with 18 subprocesses. Using
subprocess module to get around CPython's GIL issues

I'm very happy with these numbers, especially knowing that using the
Java API could probably yield further speed improvements. Also, I feel
as if my indexing process is the current bottleneck, not ES.

Thanks,
Paul


(Shay Banon) #4

Hi,

Performance numbers depend on so many factors. Let me try and cover some
of them:

  1. When using the HTTP API, it should not be much slower than the Java API
    with a good client side HTTP client. At the end, there isn't much difference
    between the two. The main points to check is that keep alive is used,
    reusing the same client, and having one that can parse http fast. Sadly (and
    very much to my surprise, to be honest) this does not seem to be the case in
    many languages. My current plan of attack here is to provide a REST based
    API on top of Thrift, so converting your current lang specific client to use
    it will be simple. Currently leaning towards thrift since it also provides
    the transport layer, not just the common serialization format, but will
    probably do a similar thing with protocol buffer with custom TCP based
    server on elasticsearch side. At the end the message will be simple and
    follow REST (method, URI, parameters, headers, body, ...).

  2. Usually, from what I have seen, the clients gets maxed on resources
    before the server does. So, whenever you conduct performance test, make sure
    you track it to see that there isn't too much load on the clients and they
    can't push enough data.

As an example, in elasticsearch codebase, there is a jmeter script to
simulate simple load. With a very very simple document, I can get on my
macbook pro (latest gen, yummy...) around 6000 docs per second indexing
with 20 concurrent clients with jmeter taking most of the CPU. Of course,
this is using loopback.

  1. The other side of the equation is elasticsearch nodes. The first and
    foremost thing to check is if you allocated enough memory to it. In Java
    land, if there isn't enough memory, long garbage collection cycles can kick
    in and start causing havoc on performance. Paul, in your case, I believe you
    create many indices, and you have a very beefy machine. Have you changed the
    default 1g max memory allocation elasticsearch comes with?

There are many more aspects to it, but in general, elasticsearch comes with
pretty good defaults out of the box. (and automatically make use of more
memory if allocated to it). The above are the first things I would check
when doing performance tests. If performance are still not satisfactory,
please share the data on the mailing list and I am here to help. There are
many other parameters that can be tweaked to increase performance but it
really a trial and error here.

I do hope to write a section about it in the docs. So people will have
something to start with. I think that the most important thing to get
implemented though is a good web monitoring application for elasticsearch,
as most of the information is exposed already through the different APIs
(except for proper index level statistics, which I hope to add sometime
soon).

-shay.banon

On Wed, Aug 18, 2010 at 11:22 PM, diptamay diptamay@gmail.com wrote:

Hi Lucas

Could you please provide a sample document that you index at such high
speeds? How complex and nested is this json document? Do you use
dynamic mapping or have specified the mapping before hand?

My numbers are more like Paul's numbers. However I use the REST api
(which is slower), as I don't want my code to depend on ES's jars.
Also my documents are pretty nested and complex (like 4-5 levels deep
in some cases), and since our content is pretty dynamic, I am using
dynamic mapping for now.

-Diptamay

On Aug 18, 2:32 pm, Lukáš Vlček lukas.vl...@gmail.com wrote:

Hi Paul,

I think you have big room for improvement. Seriously.

I can tell you that I usually index around 30.000 docs in 30-40 seconds.
The
time includes also converting the documents from raw format into json.
Size
of documents in raw format is from 2K to 250K (median probably around
10K)
but I do not know how much the size changes after converting into json. I
use java client and both the java client and the data node share the same
2
core machine with 4GB mem (but the java processes are given a lot less
then
4GB of mem). I use 10 parallel threads for indexing but I did not spend
any
time on optimization so I think these numbers can be still improved.
My time is pure indexing time, that means I do not count the time which
the
java client needs to connect to the cluster.

Regards,
Lukas

On Wed, Aug 18, 2010 at 7:37 PM, Paul ppea...@gmail.com wrote:

Hi All,
Was curious what kind of indexing speed people are getting? I know
there are a lot of variables here, but the data should still be useful
to derive some trends.

Here is where I am at.

Speed: 5000 docs/minute
DocSize: 1K-4K
Cluster config: Singler server w/ 48 cores and 48GB of RAM
Indexing config: Using pyelasticsearch with 18 subprocesses. Using
subprocess module to get around CPython's GIL issues

I'm very happy with these numbers, especially knowing that using the
Java API could probably yield further speed improvements. Also, I feel
as if my indexing process is the current bottleneck, not ES.

Thanks,
Paul


(Lukáš Vlček) #5

Hi,

I am indexing emails. You can find tons of them on the net, for example
here: http://lucene.apache.org/mail/java-user/. Download some archive,
extract it, break into individual emails (each email into one file) and then
indexing them in 10 parallel threads. No big magic so far. I have fixed
mapping but it is updated during indexing as well (I just put into mapping
what is different form defaults), the structure is relatively simple and
flat: some metadata (author, date, message-id, ... etc) and Subject and
body.

I checked my data and indexing code again now and I realized that the
numbers I gave you in previous mail apply to 10.000 documents or so (sorry
for that, I just cut my testing data set and forgot about that). So I
executed the test again with full data set ~ 30.000 (29.253 to be more
specific) and it took around 2 minutes (data in my data set is not
homogenous so the chance is that those 10000 documents in my reduced set are
"lighter" thus they make it in ~35 sec but then the rest need little more
time proportionally). According to process monitor both the cores were
running at 99% of utilization during that time (my computer became almost
unresponsive).

But these numbers are little useful as a serious benchmark as I do not know
how much time was spent on conversion from mbox into json. And again I did
not spent any time on memory settings and tuning number of threads or number
of shards of indices... etc.

Regards,
Lukas

On Wed, Aug 18, 2010 at 10:22 PM, diptamay diptamay@gmail.com wrote:

Hi Lucas

Could you please provide a sample document that you index at such high
speeds? How complex and nested is this json document? Do you use
dynamic mapping or have specified the mapping before hand?

My numbers are more like Paul's numbers. However I use the REST api
(which is slower), as I don't want my code to depend on ES's jars.
Also my documents are pretty nested and complex (like 4-5 levels deep
in some cases), and since our content is pretty dynamic, I am using
dynamic mapping for now.

-Diptamay

On Aug 18, 2:32 pm, Lukáš Vlček lukas.vl...@gmail.com wrote:

Hi Paul,

I think you have big room for improvement. Seriously.

I can tell you that I usually index around 30.000 docs in 30-40 seconds.
The
time includes also converting the documents from raw format into json.
Size
of documents in raw format is from 2K to 250K (median probably around
10K)
but I do not know how much the size changes after converting into json. I
use java client and both the java client and the data node share the same
2
core machine with 4GB mem (but the java processes are given a lot less
then
4GB of mem). I use 10 parallel threads for indexing but I did not spend
any
time on optimization so I think these numbers can be still improved.
My time is pure indexing time, that means I do not count the time which
the
java client needs to connect to the cluster.

Regards,
Lukas

On Wed, Aug 18, 2010 at 7:37 PM, Paul ppea...@gmail.com wrote:

Hi All,
Was curious what kind of indexing speed people are getting? I know
there are a lot of variables here, but the data should still be useful
to derive some trends.

Here is where I am at.

Speed: 5000 docs/minute
DocSize: 1K-4K
Cluster config: Singler server w/ 48 cores and 48GB of RAM
Indexing config: Using pyelasticsearch with 18 subprocesses. Using
subprocess module to get around CPython's GIL issues

I'm very happy with these numbers, especially knowing that using the
Java API could probably yield further speed improvements. Also, I feel
as if my indexing process is the current bottleneck, not ES.

Thanks,
Paul


(Franz Allan Valencia See) #6

Curious, how long did your java client spent converting to json and how long
it spent sending to ES for indexing (and getting a reply if done
synchronously) ? And what kind of Objects are you converting to JSON?

From my experience, the conversion to JSON is the main bottleneck and not
ES'es indexing (i.e retrieval from data source (even if it's just a
key-value pair db), conversion of complex objects like joda datetime). Im
curious whether mine is an isolated case or if it's the same for you.

Cheers,

Franz Allan Valencia See | Java Software Engineer
franz.see@gmail.com
LinkedIn: http://www.linkedin.com/in/franzsee
Twitter: http://www.twitter.com/franz_see

On Thu, Aug 19, 2010 at 2:32 AM, Lukáš Vlček lukas.vlcek@gmail.com wrote:

Hi Paul,

I think you have big room for improvement. Seriously.

I can tell you that I usually index around 30.000 docs in 30-40 seconds.
The time includes also converting the documents from raw format into json.
Size of documents in raw format is from 2K to 250K (median probably around
10K) but I do not know how much the size changes after converting into json.
I use java client and both the java client and the data node share the same
2 core machine with 4GB mem (but the java processes are given a lot less
then 4GB of mem). I use 10 parallel threads for indexing but I did not spend
any time on optimization so I think these numbers can be still improved.
My time is pure indexing time, that means I do not count the time which the
java client needs to connect to the cluster.

Regards,
Lukas

On Wed, Aug 18, 2010 at 7:37 PM, Paul ppearcy@gmail.com wrote:

Hi All,
Was curious what kind of indexing speed people are getting? I know
there are a lot of variables here, but the data should still be useful
to derive some trends.

Here is where I am at.

Speed: 5000 docs/minute
DocSize: 1K-4K
Cluster config: Singler server w/ 48 cores and 48GB of RAM
Indexing config: Using pyelasticsearch with 18 subprocesses. Using
subprocess module to get around CPython's GIL issues

I'm very happy with these numbers, especially knowing that using the
Java API could probably yield further speed improvements. Also, I feel
as if my indexing process is the current bottleneck, not ES.

Thanks,
Paul


(Lukáš Vlček) #7

As I sad, I am converting emails (using mime4j) and I did not perform any
further measurements.

On Thu, Aug 19, 2010 at 5:30 AM, Franz Allan Valencia See <
franz.see@gmail.com> wrote:

Curious, how long did your java client spent converting to json and how
long it spent sending to ES for indexing (and getting a reply if done
synchronously) ? And what kind of Objects are you converting to JSON?

From my experience, the conversion to JSON is the main bottleneck and not
ES'es indexing (i.e retrieval from data source (even if it's just a
key-value pair db), conversion of complex objects like joda datetime). Im
curious whether mine is an isolated case or if it's the same for you.

Cheers,

Franz Allan Valencia See | Java Software Engineer
franz.see@gmail.com
LinkedIn: http://www.linkedin.com/in/franzsee
Twitter: http://www.twitter.com/franz_see

On Thu, Aug 19, 2010 at 2:32 AM, Lukáš Vlček lukas.vlcek@gmail.comwrote:

Hi Paul,

I think you have big room for improvement. Seriously.

I can tell you that I usually index around 30.000 docs in 30-40 seconds.
The time includes also converting the documents from raw format into json.
Size of documents in raw format is from 2K to 250K (median probably around
10K) but I do not know how much the size changes after converting into json.
I use java client and both the java client and the data node share the same
2 core machine with 4GB mem (but the java processes are given a lot less
then 4GB of mem). I use 10 parallel threads for indexing but I did not spend
any time on optimization so I think these numbers can be still improved.
My time is pure indexing time, that means I do not count the time which
the java client needs to connect to the cluster.

Regards,
Lukas

On Wed, Aug 18, 2010 at 7:37 PM, Paul ppearcy@gmail.com wrote:

Hi All,
Was curious what kind of indexing speed people are getting? I know
there are a lot of variables here, but the data should still be useful
to derive some trends.

Here is where I am at.

Speed: 5000 docs/minute
DocSize: 1K-4K
Cluster config: Singler server w/ 48 cores and 48GB of RAM
Indexing config: Using pyelasticsearch with 18 subprocesses. Using
subprocess module to get around CPython's GIL issues

I'm very happy with these numbers, especially knowing that using the
Java API could probably yield further speed improvements. Also, I feel
as if my indexing process is the current bottleneck, not ES.

Thanks,
Paul


(Shay Banon) #8

Converting to and from json should be super fast (at the end, elasticsearch
parses and generates json as well). If you want, we can open another thread
on how to get it fast. In that thread, mention what lib you use for json
conversion.

-shay.banon

On Thu, Aug 19, 2010 at 6:30 AM, Franz Allan Valencia See <
franz.see@gmail.com> wrote:

Curious, how long did your java client spent converting to json and how
long it spent sending to ES for indexing (and getting a reply if done
synchronously) ? And what kind of Objects are you converting to JSON?

From my experience, the conversion to JSON is the main bottleneck and not
ES'es indexing (i.e retrieval from data source (even if it's just a
key-value pair db), conversion of complex objects like joda datetime). Im
curious whether mine is an isolated case or if it's the same for you.

Cheers,

Franz Allan Valencia See | Java Software Engineer
franz.see@gmail.com
LinkedIn: http://www.linkedin.com/in/franzsee
Twitter: http://www.twitter.com/franz_see

On Thu, Aug 19, 2010 at 2:32 AM, Lukáš Vlček lukas.vlcek@gmail.comwrote:

Hi Paul,

I think you have big room for improvement. Seriously.

I can tell you that I usually index around 30.000 docs in 30-40 seconds.
The time includes also converting the documents from raw format into json.
Size of documents in raw format is from 2K to 250K (median probably around
10K) but I do not know how much the size changes after converting into json.
I use java client and both the java client and the data node share the same
2 core machine with 4GB mem (but the java processes are given a lot less
then 4GB of mem). I use 10 parallel threads for indexing but I did not spend
any time on optimization so I think these numbers can be still improved.
My time is pure indexing time, that means I do not count the time which
the java client needs to connect to the cluster.

Regards,
Lukas

On Wed, Aug 18, 2010 at 7:37 PM, Paul ppearcy@gmail.com wrote:

Hi All,
Was curious what kind of indexing speed people are getting? I know
there are a lot of variables here, but the data should still be useful
to derive some trends.

Here is where I am at.

Speed: 5000 docs/minute
DocSize: 1K-4K
Cluster config: Singler server w/ 48 cores and 48GB of RAM
Indexing config: Using pyelasticsearch with 18 subprocesses. Using
subprocess module to get around CPython's GIL issues

I'm very happy with these numbers, especially knowing that using the
Java API could probably yield further speed improvements. Also, I feel
as if my indexing process is the current bottleneck, not ES.

Thanks,
Paul


(Clinton Gormley) #9

We've got two ES servers, one has 16 cores and 12GB of RAM (ES_MAX_MEM =
6GB), the other only two cores and 8GB of RAM (ES_MAX_MEM=3GB).

I find that the smaller machine falls over under heavy indexing, so I
tend to turn it off while indexing many docs.

I'm using REST over HTTP with the Perl API ElasticSearch.pm, which is
synchronous.

We have about 5 million docs in one index and 3 million in a second
index. Docs are about 1kB on average, with about 25 fields (one or two
nested). I setup my mapping before starting.

Running 5 - 10 processes, I start out indexing about 600 docs per
second, but after a few million docs have been indexed, this drops to
about 2-300 per second.

My client maxes out on CPU, and the load on the ES server varies between
< 1 and 15-20 (I assume when it is flushing the new docs). The ES server
is also responding to a live site during indexing.

ElasticSearch.pm uses LWP (a standard HTTP client in Perl) which is
complete and correct, but slow. I tried with a lighter HTTP client, and
got a 30% increase, and with a Memcached backed and, I forget the exact
numbers, but a HUGE increase. This was, however, on just a few thousand
docs, so not really representative.

I'm working on an async version of ElasticSearch.pm, and I want to get
the Memcached backend properly supported, so it should be interesting to
see what difference this makes.

clint

--
Web Announcements Limited is a company registered in England and Wales,
with company number 05608868, with registered address at 10 Arvon Road,
London, N5 1PR.


(ppearcy) #10

Thanks for all the replies!

Yeah, I figure I can get another 3x boost in index time, but have a
backend document management system that I am pulling from that I don't
want to overwhelm. It's pretty cool that the indexing is the light
weight part of the pipeline :slight_smile:

On Aug 19, 4:02 am, Clinton Gormley clin...@iannounce.co.uk wrote:

We've got two ES servers, one has 16 cores and 12GB of RAM (ES_MAX_MEM =
6GB), the other only two cores and 8GB of RAM (ES_MAX_MEM=3GB).

I find that the smaller machine falls over under heavy indexing, so I
tend to turn it off while indexing many docs.

I'm using REST over HTTP with the Perl API ElasticSearch.pm, which is
synchronous.

We have about 5 million docs in one index and 3 million in a second
index. Docs are about 1kB on average, with about 25 fields (one or two
nested). I setup my mapping before starting.

Running 5 - 10 processes, I start out indexing about 600 docs per
second, but after a few million docs have been indexed, this drops to
about 2-300 per second.

My client maxes out on CPU, and the load on the ES server varies between
< 1 and 15-20 (I assume when it is flushing the new docs). The ES server
is also responding to a live site during indexing.

ElasticSearch.pm uses LWP (a standard HTTP client in Perl) which is
complete and correct, but slow. I tried with a lighter HTTP client, and
got a 30% increase, and with a Memcached backed and, I forget the exact
numbers, but a HUGE increase. This was, however, on just a few thousand
docs, so not really representative.

I'm working on an async version of ElasticSearch.pm, and I want to get
the Memcached backend properly supported, so it should be interesting to
see what difference this makes.

clint

--
Web Announcements Limited is a company registered in England and Wales,
with company number 05608868, with registered address at 10 Arvon Road,
London, N5 1PR.


(ppearcy) #11

FYI, by spreading the my index preprocessing across more servers/
processors, was able to get ~20000 per minute. The load on the ES
server still looked negligible :slight_smile:

I think with optimizing my preprocessing and moving to the Java APIs
60k/minute is very realistic. I'd be afraid any more than that could
bring down my backend file storage.

Very impressive.

On Aug 19, 9:55 am, Paul ppea...@gmail.com wrote:

Thanks for all the replies!

Yeah, I figure I can get another 3x boost in index time, but have a
backend document management system that I am pulling from that I don't
want to overwhelm. It's pretty cool that the indexing is the light
weight part of the pipeline :slight_smile:

On Aug 19, 4:02 am, Clinton Gormley clin...@iannounce.co.uk wrote:

We've got two ES servers, one has 16 cores and 12GB of RAM (ES_MAX_MEM =
6GB), the other only two cores and 8GB of RAM (ES_MAX_MEM=3GB).

I find that the smaller machine falls over under heavy indexing, so I
tend to turn it off while indexing many docs.

I'm using REST over HTTP with the Perl API ElasticSearch.pm, which is
synchronous.

We have about 5 million docs in one index and 3 million in a second
index. Docs are about 1kB on average, with about 25 fields (one or two
nested). I setup my mapping before starting.

Running 5 - 10 processes, I start out indexing about 600 docs per
second, but after a few million docs have been indexed, this drops to
about 2-300 per second.

My client maxes out on CPU, and the load on the ES server varies between
< 1 and 15-20 (I assume when it is flushing the new docs). The ES server
is also responding to a live site during indexing.

ElasticSearch.pm uses LWP (a standard HTTP client in Perl) which is
complete and correct, but slow. I tried with a lighter HTTP client, and
got a 30% increase, and with a Memcached backed and, I forget the exact
numbers, but a HUGE increase. This was, however, on just a few thousand
docs, so not really representative.

I'm working on an async version of ElasticSearch.pm, and I want to get
the Memcached backend properly supported, so it should be interesting to
see what difference this makes.

clint

--
Web Announcements Limited is a company registered in England and Wales,
with company number 05608868, with registered address at 10 Arvon Road,
London, N5 1PR.


(system) #12