Issue Indexing 50mil Docs via Bulk API

Oh man, after a few days of tinkering (I say a few days, I've been working
on this for a while..) I'm finally indexing in bulk at a reasonable speed
~2500 docs per second.

I'm going to tweak some settings and see how fast I can get it and then
I'll post my final settings. I think the key was scaling out rather than
scaling up.

Cheers,
James

On Fri, Mar 1, 2013 at 11:18 AM, Jörg Prante joergprante@gmail.com wrote:

It's a matter of distribution of the data. Look where your primay shards
are, since all indexing goes through the primary shards, and how much
resources are there (CPU cores, Memory). Check how your routing distributes
over the shards, it depends on the routing parameter, and if the data
volume of a user varies much in respect to the average user data volume. If
overall data distribution is well, there is no "bottleneck".

Increasing shards has an upper limit. You can increase the number of
shards as long as your machines can handle the distributed indexing load.
If the limit is exceeded, just add machines, it's as easy as that.

Jörg

Am 01.03.13 11:42, schrieb james.lewis@7digital.com:

This is interesting - we use routing to make sure that all of a users

documents will be indexed in the same shard. So we're not just hitting the
bulk api with documents, we're also supplying a route for each document
based on the user id. Anyone know how this might have an impact on the
performance of a batch load?

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/**
topic/elasticsearch/**PQizsDan8Zc/unsubscribe?hl=en-**UShttps://groups.google.com/d/topic/elasticsearch/PQizsDan8Zc/unsubscribe?hl=en-US
.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@**googlegroups.comelasticsearch%2Bunsubscribe@googlegroups.com
.
For more options, visit https://groups.google.com/**groups/opt_outhttps://groups.google.com/groups/opt_out
.

--

This email, including attachments, is private and confidential. If you have
received this email in error please notify the sender and delete it from
your system. Emails are not secure and may contain viruses. No liability
can be accepted for viruses that might be transferred by this email or any
attachment. Any unauthorised copying of this message or unauthorised
distribution and publication of the information contained herein are
prohibited. 7digital Limited. Registered office: Unit F, Lower Ground
Floor, 5-25 Scrutton Street, Zetland House London EC2A 4HJ. Registered in
England and Wales. Registered No. 04843573.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

YES! VICTORY!!

After reading through all the comments and taking everyone's advice I can
now run my backfill in a fairly decent time. I haven't measured disk io at
all yet or tried to figure out where the bottle neck could be so there's
still a lot of room for improvement which is great news.

My configuration previously was 2 debian servers with 32gb RAM each and 2
cores. I scrapped that idea and built a new cluster with 4 data nodes and
2 router nodes. I pointed a HAProxy LB at the 2 router nodes and the
pointed my backfill application at the LB. The 4 data nodes have 8gb RAM
each and the 2 router nodes have 4.

Next thing I did was alter my backfill app. I set the max concurrent
connections on my .Net (NEST) elasticsearch client to 25. I then altered
my backfill to read chunks of 1000000 docs into memory, then used a
parallel foreach with a max degree of parallelism of 25. I set the batch
size of each bulk insert to be 500 docs.

That sped my backfill up but it wasn't reliable. It still slowed down
somewhat after 200K inserts (I don't know why yet). So the next thing I
did was to build a VM just for my backfill (I was running this on a fairly
questionable dev box). I built a win 2008 server box with 8 virtual cores
and 8gb ram, deployed my backfill app to it and hit the go button. BANG!
it started raining indexes. Totally awesome.

Unfortunately in my haste to get this working I forgot that I was an
engineer and didn't measure anything. So I can't tell you what exactly
improved it. I'll put my engineer hat on again later in the week and try
and figure it out. But until then, thanks to everyone who helped out.

Regards,
James

On Fri, Mar 1, 2013 at 4:09 PM, James Lewis james.lewis@7digital.comwrote:

Oh man, after a few days of tinkering (I say a few days, I've been working
on this for a while..) I'm finally indexing in bulk at a reasonable speed
~2500 docs per second.

I'm going to tweak some settings and see how fast I can get it and then
I'll post my final settings. I think the key was scaling out rather than
scaling up.

Cheers,
James

On Fri, Mar 1, 2013 at 11:18 AM, Jörg Prante joergprante@gmail.comwrote:

It's a matter of distribution of the data. Look where your primay shards
are, since all indexing goes through the primary shards, and how much
resources are there (CPU cores, Memory). Check how your routing distributes
over the shards, it depends on the routing parameter, and if the data
volume of a user varies much in respect to the average user data volume. If
overall data distribution is well, there is no "bottleneck".

Increasing shards has an upper limit. You can increase the number of
shards as long as your machines can handle the distributed indexing load.
If the limit is exceeded, just add machines, it's as easy as that.

Jörg

Am 01.03.13 11:42, schrieb james.lewis@7digital.com:

This is interesting - we use routing to make sure that all of a users

documents will be indexed in the same shard. So we're not just hitting the
bulk api with documents, we're also supplying a route for each document
based on the user id. Anyone know how this might have an impact on the
performance of a batch load?

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/**
topic/elasticsearch/**PQizsDan8Zc/unsubscribe?hl=en-**UShttps://groups.google.com/d/topic/elasticsearch/PQizsDan8Zc/unsubscribe?hl=en-US
.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@**googlegroups.comelasticsearch%2Bunsubscribe@googlegroups.com
.
For more options, visit https://groups.google.com/**groups/opt_outhttps://groups.google.com/groups/opt_out
.

--

This email, including attachments, is private and confidential. If you have
received this email in error please notify the sender and delete it from
your system. Emails are not secure and may contain viruses. No liability
can be accepted for viruses that might be transferred by this email or any
attachment. Any unauthorised copying of this message or unauthorised
distribution and publication of the information contained herein are
prohibited. 7digital Limited. Registered office: Unit F, Lower Ground
Floor, 5-25 Scrutton Street, Zetland House London EC2A 4HJ. Registered in
England and Wales. Registered No. 04843573.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi james.lewis/Randall,

i am facing indexing performance problem.

I have deployed the ES in AWS EC2 m3.xlarge server. also configured the yml file with cluster information and also created the index with 10 shards with 1 replica. also done some configurations. but still i am getting 1200-1300 docs/sec. sampling is done with 390312 docs (nearly 2GB).

if your interested to reply then i will send the total configuration set.

my target is to get 2500+ docs/sec.

waiting for your reply
regards,
Bala