+1 Jilles' points. And reduce your refresh settings as much as you can ( we
set it to -1 in some cases).
One approach that worked quite well for us , though I realise this doesnt
apply to all cases...works great on a scriptable virtual env like AWS -
when a full reindex is needed, we spin up a separate cluster, larger
instances ( m2.4xls instead of , say, m1.xl s used in production, 4
instances, 0 replicas but obviously correct mappings and shards), fast
disks ( 10+ ebs on raid 0), disable refresh, bulk insert ( 10k docs, 20+
threads)... Indexed ~75 million docs in 90 minutes or so. Re-enable
refresh;
Then add to this cluster new std instances, set route options away from the
huge instances, increase # replicas, run smoke tests, cut over
app/consumers to new cluster, reindex any docs which may be needed (
depends on your app..we usually duplicate messages and play them back),
kill your old cluster..
We use some of these options when doing bulk updates on a cluster without
needing to migrate to a new one.
B.
On 02/08/2013 7:52 PM, "Jilles van Gurp" jillesvangurp@gmail.com wrote:
I'd recommend using smaller bulk sizes and submitting bulk index requests
concurrently with a few threads. Sending a 100.000 documents in one request
is quite hard on ES. I find it is much better at indexing in smaller
chunks. Basically allows it to refresh more frequently and process smaller
merges. Even better, es can handle such requests concurrently. If you
combine that with a refresh of maybe a bit more than the default of 1
second, you should be able to do better. I have a small bulk indexer class
that basically iterates over a collection and sends off bulk index requests
with a configurable number of threads and pages size. I'm planning to put
it on Github at some point when I get around to untangling it from our code
base.
In any case, measure the response time for your bulk requests and try to
keep them around a second each. This way, you can throttle the throughput
using number of threads and bulk size. With this setup, you could be
indexing thousands of documents per second even on a modest setup
(depending mainly on IO and CPU needed per document). Add nodes and threads
to scale. You might want to use iostat and top while this is happening to
see if you are hitting IO or CPU limits.
Jilles
On Thursday, August 1, 2013 1:51:41 PM UTC+2, Jörg Prante wrote:
Are you sure you have sized your ES hardware to your requirements?
Have you measured the network traffic? Are you really sending "50-100mb
in a few seconds" over the wire?
Note, if I had to process 100m in a second, I would ramp up 10 servers or
so, not 2.
And yes, SSD can definitely help.
Take a look at the index refresh. While bulk inserting, you should
disable refresh.
If you use the default merge settings, it might not be optimal for bulk,
it also depends how much resources you can spend on the heap for segment
merging.
If you mix bulk deletes with bulk inserts, your performance will degrade,
no matter what you do. You have two options: try to avoid doc deletes by a
sliding window index organisation so you can delete whole indexes which is
a breeze, or, you can try to combine deletes as much as possible, and issue
them rarely, in times when you do not insert into the same index.
Jörg
--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.