Bulk Insert Throughput Issues

I have a question regarding the throughput of the _bulk inserting of
elasticsearch, running version 0.90.3 on ubuntu 13.04 64bit. I've done a
great deal of searching and have been hard pressed to find anything that
appropriately covers our needs.

Currently In our system we're looking to Index data at very high volumes,
currently we're looking at indexing, albeit an odd parent / child
relationship mapping
at https://gist.github.com/fidyeates/18cec7116926516bc033, under our
current traffic we're looking at around indexing 1,500 'details' documents
per second which equates to, on average, 12,000 documents per second needed
to be indexed, at around 2-3kb size on average per document, roughly giving
us a write rate of 30-40mb / second into elasticsearch,

Our elasticsearch deployment is currently 2 m1.large amazon EC2 instances,
configured with default configurations, 10 shards and 0 replicas. We are
also rotating keys daily and flushing data thats older then two weeks.

However, this does seem to be the capacity of this elasticsearch
deployment, i.e. we're waiting for long amounts of time on all the _bulk
calls (tested on bulk amounts from 500-10,000 documents), and we're
currently feeding in from 3 parallel threads with the calls evenly
distributed across the es nodes.

1.) Are there any good performance tweaks we can make to the cluster to
increase the amount of indexing throughput? I can understand if we have hit
the capacity of two m1.larges (2 core, 4ECU's, 7.5gb ram).

2.) This is all disregarding search performance, as we will also be wanting
to make queries on this data, can we expect to disrupt the _bulk
performance significantly by running queries on 'old' indexes?

3.) Just any tips or a point in the right direction regarding documentation
(the elasticsearch docs are great though!) about this specific area would
be much obliged!

Cheers,

Fin

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Fin Yeates wrote:

1.) Are there any good performance tweaks we can make to the
cluster to increase the amount of indexing throughput? I can
understand if we have hit the capacity of two m1.larges (2 core,
4ECU's, 7.5gb ram).

30-40M/s is all I would expect from two m1.larges. They average
15-20M a piece. You could try indexing from multiple threads
across multiple nodes and see if you can increase the overall
rate but I would expect those two nodes are near network capacity.

2.) This is all disregarding search performance, as we will also
be wanting to make queries on this data, can we expect to
disrupt the _bulk performance significantly by running queries
on 'old' indexes?

Depends on what kind of data you want to pull out. I suspect
you're not really taxing CPU, and, depending on what storage you
use, may not even be taxing the disk.

3.) Just any tips or a point in the right direction regarding
documentation (the elasticsearch docs are great though!) about
this specific area would be much obliged!

In general ES is not the bottleneck when indexing. If you can
parallelize your IO, as well as streamline whatever process feeds
ES (test the rate you can send to /dev/null, for example), you can
scale very well.

Drew

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.