Optimizing ElasticSearch for Heavy Insertions

Hey,

Currently, due to the large number of bulk inserts and the frequency of
them, our ES cluster can go to yellow to red rather quickly. Once in red,
it does recover, but takes 15-20 minutes. In fact the last time it went to
red, it lost half the documents by the time it recovered. I could use some
help optimizing my ES cluster based on the settings
here: http://www.elasticsearch.org/guide/reference/index-modules/merge/

ES cluster has ~14 million records for one index (the only one), nothing is
analyzed, about 9-10gb in size, 2 nodes, 5 shards on each (unfortunately no
SSD yet). During peaks, we would bulk insert up to 100,000 records, which
can be anywhere from 50-100mb in a few seconds. This is usually spread
across 10 workers, so 10 separate requests of 10k records. On top of that,
there is a similar number of bulk deletes.

As you can see, the data we have is highly volatile, so we need to add new
data and purge often. When the spikes aren't as high, everything works
amazing. How do you recommend that we tweak these settings?

Here is what status API
returns: https://gist.github.com/jeyb/fe613843e9f768be6738 (currently this
doesn't have the full data set of 14m records)

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Are you sure you have sized your ES hardware to your requirements?

Have you measured the network traffic? Are you really sending "50-100mb in
a few seconds" over the wire?

Note, if I had to process 100m in a second, I would ramp up 10 servers or
so, not 2.

And yes, SSD can definitely help.

Take a look at the index refresh. While bulk inserting, you should disable
refresh.

If you use the default merge settings, it might not be optimal for bulk, it
also depends how much resources you can spend on the heap for segment
merging.

If you mix bulk deletes with bulk inserts, your performance will degrade,
no matter what you do. You have two options: try to avoid doc deletes by a
sliding window index organisation so you can delete whole indexes which is
a breeze, or, you can try to combine deletes as much as possible, and issue
them rarely, in times when you do not insert into the same index.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

I'd recommend using smaller bulk sizes and submitting bulk index requests
concurrently with a few threads. Sending a 100.000 documents in one request
is quite hard on ES. I find it is much better at indexing in smaller
chunks. Basically allows it to refresh more frequently and process smaller
merges. Even better, es can handle such requests concurrently. If you
combine that with a refresh of maybe a bit more than the default of 1
second, you should be able to do better. I have a small bulk indexer class
that basically iterates over a collection and sends off bulk index requests
with a configurable number of threads and pages size. I'm planning to put
it on Github at some point when I get around to untangling it from our code
base.

In any case, measure the response time for your bulk requests and try to
keep them around a second each. This way, you can throttle the throughput
using number of threads and bulk size. With this setup, you could be
indexing thousands of documents per second even on a modest setup
(depending mainly on IO and CPU needed per document). Add nodes and threads
to scale. You might want to use iostat and top while this is happening to
see if you are hitting IO or CPU limits.

Jilles

On Thursday, August 1, 2013 1:51:41 PM UTC+2, Jörg Prante wrote:

Are you sure you have sized your ES hardware to your requirements?

Have you measured the network traffic? Are you really sending "50-100mb in
a few seconds" over the wire?

Note, if I had to process 100m in a second, I would ramp up 10 servers or
so, not 2.

And yes, SSD can definitely help.

Take a look at the index refresh. While bulk inserting, you should disable
refresh.

If you use the default merge settings, it might not be optimal for bulk,
it also depends how much resources you can spend on the heap for segment
merging.

If you mix bulk deletes with bulk inserts, your performance will degrade,
no matter what you do. You have two options: try to avoid doc deletes by a
sliding window index organisation so you can delete whole indexes which is
a breeze, or, you can try to combine deletes as much as possible, and issue
them rarely, in times when you do not insert into the same index.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Jilles, there is already a BulkProcessor class in ES core to assist in bulk
indexing, when using the Java API.

I have also implemented some helper classes with concurrent lists in the
bulk item queueing and shorter bulk responses, where only the bulk errors
are in the response.

One more note, the concurrent processing is already there in ES if you
ingest a large single bulk, but the larger the bulk action is, the longer
it takes to decode and distribute on a single node. On the client side,
concurrency can be achieved by using multithreaded doc preparation, and a
TransportClient with several nodes connected, so the client takes
round-robin strategy to distribute the bulk actions.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

+1 Jilles' points. And reduce your refresh settings as much as you can ( we
set it to -1 in some cases).

One approach that worked quite well for us , though I realise this doesnt
apply to all cases...works great on a scriptable virtual env like AWS -
when a full reindex is needed, we spin up a separate cluster, larger
instances ( m2.4xls instead of , say, m1.xl s used in production, 4
instances, 0 replicas but obviously correct mappings and shards), fast
disks ( 10+ ebs on raid 0), disable refresh, bulk insert ( 10k docs, 20+
threads)... Indexed ~75 million docs in 90 minutes or so. Re-enable
refresh;
Then add to this cluster new std instances, set route options away from the
huge instances, increase # replicas, run smoke tests, cut over
app/consumers to new cluster, reindex any docs which may be needed (
depends on your app..we usually duplicate messages and play them back),
kill your old cluster..

We use some of these options when doing bulk updates on a cluster without
needing to migrate to a new one.
B.
On 02/08/2013 7:52 PM, "Jilles van Gurp" jillesvangurp@gmail.com wrote:

I'd recommend using smaller bulk sizes and submitting bulk index requests
concurrently with a few threads. Sending a 100.000 documents in one request
is quite hard on ES. I find it is much better at indexing in smaller
chunks. Basically allows it to refresh more frequently and process smaller
merges. Even better, es can handle such requests concurrently. If you
combine that with a refresh of maybe a bit more than the default of 1
second, you should be able to do better. I have a small bulk indexer class
that basically iterates over a collection and sends off bulk index requests
with a configurable number of threads and pages size. I'm planning to put
it on Github at some point when I get around to untangling it from our code
base.

In any case, measure the response time for your bulk requests and try to
keep them around a second each. This way, you can throttle the throughput
using number of threads and bulk size. With this setup, you could be
indexing thousands of documents per second even on a modest setup
(depending mainly on IO and CPU needed per document). Add nodes and threads
to scale. You might want to use iostat and top while this is happening to
see if you are hitting IO or CPU limits.

Jilles

On Thursday, August 1, 2013 1:51:41 PM UTC+2, Jörg Prante wrote:

Are you sure you have sized your ES hardware to your requirements?

Have you measured the network traffic? Are you really sending "50-100mb
in a few seconds" over the wire?

Note, if I had to process 100m in a second, I would ramp up 10 servers or
so, not 2.

And yes, SSD can definitely help.

Take a look at the index refresh. While bulk inserting, you should
disable refresh.

If you use the default merge settings, it might not be optimal for bulk,
it also depends how much resources you can spend on the heap for segment
merging.

If you mix bulk deletes with bulk inserts, your performance will degrade,
no matter what you do. You have two options: try to avoid doc deletes by a
sliding window index organisation so you can delete whole indexes which is
a breeze, or, you can try to combine deletes as much as possible, and issue
them rarely, in times when you do not insert into the same index.

Jörg

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.