Issue Indexing 50mil Docs via Bulk API

Hi guys, I've got an issue I need a hand diagnosing to do with bulk
indexing. I have a feeling I could just be hammering the cluster too hard
but just in case it's something I've setup wrong I've included as much
information as possible.

Elasticsearch Configuration

We have two nodes, running elasticsearch v0.20.4 with 1 index split across
100 shards and 1 replica of the index (we've over allocated and we use
routing / aliases when indexing / searching). This is the
elasticsearch.yml config on both of our nodeshttps://gist.github.com/getsometoast/5047292and this
is the logging.yml for both the nodeshttps://gist.github.com/getsometoast/5047308.
Both nodes are running oracle Java version: 1.7.0_13 and both have 32GB RAM
with 24GB allocated to the JVM. So you can see the java environment
variables for the process here's a gist of it.https://gist.github.com/getsometoast/5047322

Background
Last night I ran a backfill into our production cluster, I tried to index
50 million documents (avg doc size ~2kb) via the bulk API in chunks of
100,000. I pointed my elasticsearch client at one of the nodes in the
cluster and left the backfill process running over night. When I came in
to inspect it this morning it had run for approx. 11hrs and indexed ~3/4 of
the 50 million docs and then hung. I had a look at my backfill process
logs and it had hung sending a request to the bulk API.

I then inspected the logs on my elasticsearch nodes (I'm having a lot of
issues with elasticsearch logging but that's another topic). The log for
the node that I was sending my bulk index request to has nothing in it for
the majority of the time I was running the backfill for (everything is set
to debug in the yml so this confuses me..) However it does have some
entries at about 5pm yesterday evening which is shortly after I kicked off
the process and then some again this morning, long after the backfill
process started hanging - here's a gist of the outputhttps://gist.github.com/getsometoast/5047202.
The other node has nothing in its logs.

The other node, the one I wasn't sending the bulk index request to, is
currently using 77% of the memory allocated to it, way higher than the
other node. Here's the output from paramedic:

https://lh6.googleusercontent.com/-FGqUcxfYWgs/US3z7ZzTTCI/AAAAAAAAAAM/Lb5o74WZ698/s1600/elasticsearch-paramedic.jpeg

Just for completeness, here's how I run a backfill:
https://lh6.googleusercontent.com/-FGqUcxfYWgs/US3z7ZzTTCI/AAAAAAAAAAM/Lb5o74WZ698/s1600/elasticsearch-paramedic.jpeg

  • set the index refresh to -1 and merge policy factor to 30
  • read all the data I need into memory
  • denormalize the data in 100,000 object batches
  • create a bulk request for the batch
  • send the bulk request to one of my elasticsearch nodes (hard coded
    address - last night I used the current master node in the cluster)
  • when finished processing all data, set the index refresh to 1s and
    merge factor to 10

My Question
Sorry if all that was a bit long winded - I just wanted to get everything I
know about the state of the system down to see if anyone could spot
anything weird I might be doing. My question is, why would a request to
the bulk API hang and never timeout and when this happens why would I not
see any clear sign of error on the cluster?

Any help, questions etc greatly appreciated as this is starting to block
our progress.

Regards,
James

--

This email, including attachments, is private and confidential. If you have
received this email in error please notify the sender and delete it from
your system. Emails are not secure and may contain viruses. No liability
can be accepted for viruses that might be transferred by this email or any
attachment. Any unauthorised copying of this message or unauthorised
distribution and publication of the information contained herein are
prohibited. 7digital Limited. Registered office: Unit F, Lower Ground
Floor, 5-25 Scrutton Street, Zetland House London EC2A 4HJ. Registered in
England and Wales. Registered No. 04843573.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hiya

We have two nodes, running elasticsearch v0.20.4 with 1 index split
across 100 shards and 1 replica of the index (we've over allocated and
we use routing / aliases when indexing / searching). This is the
elasticsearch.yml config on both of our nodes and this is the
logging.yml for both the nodes. Both nodes are running oracle Java
version: 1.7.0_13 and both have 32GB RAM with 24GB allocated to the
JVM.

Side note: you want to leave about 50% of your RAM for the kernel
filesystem cache, so I'd set your heap to 16GB, not 24GB.

Background
Last night I ran a backfill into our production cluster, I tried to
index 50 million documents (avg doc size ~2kb) via the bulk API in
chunks of 100,000. I pointed my elasticsearch client at one of the
nodes in the cluster and left the backfill process running over night.
When I came in to inspect it this morning it had run for approx. 11hrs
and indexed ~3/4 of the 50 million docs and then hung. I had a look
at my backfill process logs and it had hung sending a request to the
bulk API.

Bulk loads of 100,000 docs seems very high to me - you may run into
memory issues. In my tests (based on my own documents) the sweet spot is
1000-5000 docs at a time, although you can run multiple "feeders".

clint

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hey Clinton,

Cheers for the memory advice - I'll get that sorted.

By multiple feeders you mean I could create a batch of 100,000 docs and
then bulk index the batch in chunks of about 1000 in parallel? That sounds
like a great idea - I settled on batches of 100,000 because I calculated
that any less and the bulk indexing operation would take more than 24 hours
in total which isn't acceptable. But smaller batches in parallel sounds
like a plan.

Any idea how many concurrent requests I could fire at ES before I see
problems? I'll be testing this myself today but a starter for 10 would be
appreciated.

Cheers,
James

On Wed, Feb 27, 2013 at 12:10 PM, Clinton Gormley clint@traveljury.comwrote:

Hiya

We have two nodes, running elasticsearch v0.20.4 with 1 index split
across 100 shards and 1 replica of the index (we've over allocated and
we use routing / aliases when indexing / searching). This is the
elasticsearch.yml config on both of our nodes and this is the
logging.yml for both the nodes. Both nodes are running oracle Java
version: 1.7.0_13 and both have 32GB RAM with 24GB allocated to the
JVM.

Side note: you want to leave about 50% of your RAM for the kernel
filesystem cache, so I'd set your heap to 16GB, not 24GB.

Background
Last night I ran a backfill into our production cluster, I tried to
index 50 million documents (avg doc size ~2kb) via the bulk API in
chunks of 100,000. I pointed my elasticsearch client at one of the
nodes in the cluster and left the backfill process running over night.
When I came in to inspect it this morning it had run for approx. 11hrs
and indexed ~3/4 of the 50 million docs and then hung. I had a look
at my backfill process logs and it had hung sending a request to the
bulk API.

Bulk loads of 100,000 docs seems very high to me - you may run into
memory issues. In my tests (based on my own documents) the sweet spot is
1000-5000 docs at a time, although you can run multiple "feeders".

clint

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/PQizsDan8Zc/unsubscribe?hl=en-US
.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--

This email, including attachments, is private and confidential. If you have
received this email in error please notify the sender and delete it from
your system. Emails are not secure and may contain viruses. No liability
can be accepted for viruses that might be transferred by this email or any
attachment. Any unauthorised copying of this message or unauthorised
distribution and publication of the information contained herein are
prohibited. 7digital Limited. Registered office: Unit F, Lower Ground
Floor, 5-25 Scrutton Street, Zetland House London EC2A 4HJ. Registered in
England and Wales. Registered No. 04843573.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Heya

Cheers for the memory advice - I'll get that sorted.

Also make sure you're not swapping.

By multiple feeders you mean I could create a batch of 100,000 docs
and then bulk index the batch in chunks of about 1000 in parallel?
That sounds like a great idea - I settled on batches of 100,000
because I calculated that any less and the bulk indexing operation
would take more than 24 hours in total which isn't acceptable. But
smaller batches in parallel sounds like a plan.

Any idea how many concurrent requests I could fire at ES before I see
problems? I'll be testing this myself today but a starter for 10
would be appreciated.

Heh - "it depends". Difficult to give you a clear answer.

Also consider round robining requests - don't send them all to one node.
One thing you want to avoid is, you hit a big merge, but you keep
throwing bulk requests at a system which has slowed down because of
heavy IO.

So you probably want to limit the number of threads available for bulk
indexing:

clint

Cheers,

James

On Wed, Feb 27, 2013 at 12:10 PM, Clinton Gormley
clint@traveljury.com wrote:
Hiya
> We have two nodes, running elasticsearch v0.20.4 with 1
index split
> across 100 shards and 1 replica of the index (we've over
allocated and
> we use routing / aliases when indexing / searching). This
is the
> elasticsearch.yml config on both of our nodes and this is
the
> logging.yml for both the nodes. Both nodes are running
oracle Java
> version: 1.7.0_13 and both have 32GB RAM with 24GB allocated
to the
> JVM.

    Side note: you want to leave about 50% of your RAM for the
    kernel
    filesystem cache, so I'd set your heap to 16GB, not 24GB.
    
    > Background
    > Last night I ran a backfill into our production cluster, I
    tried to
    > index 50 million documents (avg doc size ~2kb) via the bulk
    API in
    > chunks of 100,000.  I pointed my elasticsearch client at one
    of the
    > nodes in the cluster and left the backfill process running
    over night.
    > When I came in to inspect it this morning it had run for
    approx. 11hrs
    > and indexed ~3/4 of the 50 million docs and then hung.  I
    had a look
    > at my backfill process logs and it had hung sending a
    request to the
    > bulk API.
    
    
    Bulk loads of 100,000 docs seems very high to me - you may run
    into
    memory issues. In my tests (based on my own documents) the
    sweet spot is
    1000-5000 docs at a time, although you can run multiple
    "feeders".
    
    clint
    
    
    --
    You received this message because you are subscribed to a
    topic in the Google Groups "elasticsearch" group.
    To unsubscribe from this topic, visit
    https://groups.google.com/d/topic/elasticsearch/PQizsDan8Zc/unsubscribe?hl=en-US.
    To unsubscribe from this group and all its topics, send an
    email to elasticsearch+unsubscribe@googlegroups.com.
    For more options, visit
    https://groups.google.com/groups/opt_out.

--
This email, including attachments, is private and confidential. If you
have
received this email in error please notify the sender and delete it
from
your system. Emails are not secure and may contain viruses. No
liability
can be accepted for viruses that might be transferred by this email or
any
attachment. Any unauthorised copying of this message or unauthorised
distribution and publication of the information contained herein are
prohibited. 7digital Limited. Registered office: Unit F, Lower Ground
Floor, 5-25 Scrutton Street, Zetland House London EC2A 4HJ. Registered
in
England and Wales. Registered No. 04843573.

You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

All great info - thanks Clinton. Also just reading through dmyung's bulk
indexing questionshttps://groups.google.com/forum/?fromgroups=#!topic/elasticsearch/JWGUztmgc_s-
think I need to go and have a read about how Lucene works as I don't
fully understand the merge stuff.

Thanks again,
James

On Wed, Feb 27, 2013 at 12:32 PM, Clinton Gormley clint@traveljury.comwrote:

Heya

Cheers for the memory advice - I'll get that sorted.

Also make sure you're not swapping.

By multiple feeders you mean I could create a batch of 100,000 docs
and then bulk index the batch in chunks of about 1000 in parallel?
That sounds like a great idea - I settled on batches of 100,000
because I calculated that any less and the bulk indexing operation
would take more than 24 hours in total which isn't acceptable. But
smaller batches in parallel sounds like a plan.

Any idea how many concurrent requests I could fire at ES before I see
problems? I'll be testing this myself today but a starter for 10
would be appreciated.

Heh - "it depends". Difficult to give you a clear answer.

Also consider round robining requests - don't send them all to one node.
One thing you want to avoid is, you hit a big merge, but you keep
throwing bulk requests at a system which has slowed down because of
heavy IO.

So you probably want to limit the number of threads available for bulk
indexing:
Elasticsearch Platform — Find real-time answers at scale | Elastic

clint

Cheers,

James

On Wed, Feb 27, 2013 at 12:10 PM, Clinton Gormley
clint@traveljury.com wrote:
Hiya
> We have two nodes, running elasticsearch v0.20.4 with 1
index split
> across 100 shards and 1 replica of the index (we've over
allocated and
> we use routing / aliases when indexing / searching). This
is the
> elasticsearch.yml config on both of our nodes and this is
the
> logging.yml for both the nodes. Both nodes are running
oracle Java
> version: 1.7.0_13 and both have 32GB RAM with 24GB allocated
to the
> JVM.

    Side note: you want to leave about 50% of your RAM for the
    kernel
    filesystem cache, so I'd set your heap to 16GB, not 24GB.

    > Background
    > Last night I ran a backfill into our production cluster, I
    tried to
    > index 50 million documents (avg doc size ~2kb) via the bulk
    API in
    > chunks of 100,000.  I pointed my elasticsearch client at one
    of the
    > nodes in the cluster and left the backfill process running
    over night.
    > When I came in to inspect it this morning it had run for
    approx. 11hrs
    > and indexed ~3/4 of the 50 million docs and then hung.  I
    had a look
    > at my backfill process logs and it had hung sending a
    request to the
    > bulk API.


    Bulk loads of 100,000 docs seems very high to me - you may run
    into
    memory issues. In my tests (based on my own documents) the
    sweet spot is
    1000-5000 docs at a time, although you can run multiple
    "feeders".

    clint


    --
    You received this message because you are subscribed to a
    topic in the Google Groups "elasticsearch" group.
    To unsubscribe from this topic, visit

https://groups.google.com/d/topic/elasticsearch/PQizsDan8Zc/unsubscribe?hl=en-US
.

    To unsubscribe from this group and all its topics, send an
    email to elasticsearch+unsubscribe@googlegroups.com.
    For more options, visit
    https://groups.google.com/groups/opt_out.

--
This email, including attachments, is private and confidential. If you
have
received this email in error please notify the sender and delete it
from
your system. Emails are not secure and may contain viruses. No
liability
can be accepted for viruses that might be transferred by this email or
any
attachment. Any unauthorised copying of this message or unauthorised
distribution and publication of the information contained herein are
prohibited. 7digital Limited. Registered office: Unit F, Lower Ground
Floor, 5-25 Scrutton Street, Zetland House London EC2A 4HJ. Registered
in
England and Wales. Registered No. 04843573.

You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/PQizsDan8Zc/unsubscribe?hl=en-US
.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--

This email, including attachments, is private and confidential. If you have
received this email in error please notify the sender and delete it from
your system. Emails are not secure and may contain viruses. No liability
can be accepted for viruses that might be transferred by this email or any
attachment. Any unauthorised copying of this message or unauthorised
distribution and publication of the information contained herein are
prohibited. 7digital Limited. Registered office: Unit F, Lower Ground
Floor, 5-25 Scrutton Street, Zetland House London EC2A 4HJ. Registered in
England and Wales. Registered No. 04843573.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Just a note from my experience:

It's not helpful to start a bulk index against a single node with 100
threads in parallel and 1000 docs in each request.

Start small and scale out until you find a "sweet spot".

Estimate the bulk capacity first. Put up a bulk request with 10, 100,
1000 ... docs and do a complete single thread bulk run. Measure the
average response time of a bulk request (not the "took" time in the bulk
response). Compute the size of a bulk request in bytes, and you have a
rough number of what a single thread can index against a single node.
The response time is proportional to your disk subsystem performance.

Now, check how many CPU cores your node is running, that is my rule of
thumb. Check the network bandwith also, if you bulk index from remote.
You can now go up to approx. 2x or 3x of the number of the total CPU
cores of the nodes you connect to, or the maximum capacity of the
network (compression is on by default), whatever comes first.

If you go higher, your bulk index will be slow (because of the high
load), or may even stall sooner or later because of the congestion.

Best regards,

Jörg

Am 27.02.13 13:23, schrieb James Lewis:

Hey Clinton,

Cheers for the memory advice - I'll get that sorted.

By multiple feeders you mean I could create a batch of 100,000 docs
and then bulk index the batch in chunks of about 1000 in parallel?
That sounds like a great idea - I settled on batches of 100,000
because I calculated that any less and the bulk indexing operation
would take more than 24 hours in total which isn't acceptable. But
smaller batches /in parallel/ sounds like a plan.

Any idea how many concurrent requests I could fire at ES before I see
problems? I'll be testing this myself today but a starter for 10
would be appreciated.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Yeah I think one of my problems here is that I'm feeling a bit under
pressure and letting my engineering best practices slip. You're absolutely
right, I should be measuring everything, taking small steps and finding my
own sweet spot rather than looking for a silver bullet.

Thanks for the advice guys, I'll write a reply post with my findings when
I'm done for anyone else who comes across this.

Regards,
James

On Wednesday, February 27, 2013 12:52:05 PM UTC, Jörg Prante wrote:

Just a note from my experience:

It's not helpful to start a bulk index against a single node with 100
threads in parallel and 1000 docs in each request.

Start small and scale out until you find a "sweet spot".

Estimate the bulk capacity first. Put up a bulk request with 10, 100,
1000 ... docs and do a complete single thread bulk run. Measure the
average response time of a bulk request (not the "took" time in the bulk
response). Compute the size of a bulk request in bytes, and you have a
rough number of what a single thread can index against a single node.
The response time is proportional to your disk subsystem performance.

Now, check how many CPU cores your node is running, that is my rule of
thumb. Check the network bandwith also, if you bulk index from remote.
You can now go up to approx. 2x or 3x of the number of the total CPU
cores of the nodes you connect to, or the maximum capacity of the
network (compression is on by default), whatever comes first.

If you go higher, your bulk index will be slow (because of the high
load), or may even stall sooner or later because of the congestion.

Best regards,

Jörg

Am 27.02.13 13:23, schrieb James Lewis:

Hey Clinton,

Cheers for the memory advice - I'll get that sorted.

By multiple feeders you mean I could create a batch of 100,000 docs
and then bulk index the batch in chunks of about 1000 in parallel?
That sounds like a great idea - I settled on batches of 100,000
because I calculated that any less and the bulk indexing operation
would take more than 24 hours in total which isn't acceptable. But
smaller batches /in parallel/ sounds like a plan.

Any idea how many concurrent requests I could fire at ES before I see
problems? I'll be testing this myself today but a starter for 10
would be appreciated.

--

This email, including attachments, is private and confidential. If you have
received this email in error please notify the sender and delete it from
your system. Emails are not secure and may contain viruses. No liability
can be accepted for viruses that might be transferred by this email or any
attachment. Any unauthorised copying of this message or unauthorised
distribution and publication of the information contained herein are
prohibited. 7digital Limited. Registered office: Unit F, Lower Ground
Floor, 5-25 Scrutton Street, Zetland House London EC2A 4HJ. Registered in
England and Wales. Registered No. 04843573.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

We found a bulk api setting of 2500 to be around optimal for us. It also
depends on how big your docs are (how big are they?). As other posters have
mentioned setting the bulk too high merely consumes your resources and does
not actually get things done faster. We use four threads, but we're
actually trying to keep docs going in at a steady rate since the cluster is
live on another index while building the new one. We index 30m docs in less
than 6 hours (index is 70gb). But our cluster is larger: 8 machines each
with one shard.

Also we found that 32gb of memory per instance was not enough--we ended up
with 63gb per. Its true also that the least possible JVM heap should be
given to elastic--the file system cache is a much better use for that
memory. Our jvm heap is set to 20gb out of that 63gb.

Randy

On Wed, Feb 27, 2013 at 5:06 AM, james.lewis@7digital.com wrote:

Yeah I think one of my problems here is that I'm feeling a bit under
pressure and letting my engineering best practices slip. You're absolutely
right, I should be measuring everything, taking small steps and finding my
own sweet spot rather than looking for a silver bullet.

Thanks for the advice guys, I'll write a reply post with my findings when
I'm done for anyone else who comes across this.

Regards,
James

On Wednesday, February 27, 2013 12:52:05 PM UTC, Jörg Prante wrote:

Just a note from my experience:

It's not helpful to start a bulk index against a single node with 100
threads in parallel and 1000 docs in each request.

Start small and scale out until you find a "sweet spot".

Estimate the bulk capacity first. Put up a bulk request with 10, 100,
1000 ... docs and do a complete single thread bulk run. Measure the
average response time of a bulk request (not the "took" time in the bulk
response). Compute the size of a bulk request in bytes, and you have a
rough number of what a single thread can index against a single node.
The response time is proportional to your disk subsystem performance.

Now, check how many CPU cores your node is running, that is my rule of
thumb. Check the network bandwith also, if you bulk index from remote.
You can now go up to approx. 2x or 3x of the number of the total CPU
cores of the nodes you connect to, or the maximum capacity of the
network (compression is on by default), whatever comes first.

If you go higher, your bulk index will be slow (because of the high
load), or may even stall sooner or later because of the congestion.

Best regards,

Jörg

Am 27.02.13 13:23, schrieb James Lewis:

Hey Clinton,

Cheers for the memory advice - I'll get that sorted.

By multiple feeders you mean I could create a batch of 100,000 docs
and then bulk index the batch in chunks of about 1000 in parallel?
That sounds like a great idea - I settled on batches of 100,000
because I calculated that any less and the bulk indexing operation
would take more than 24 hours in total which isn't acceptable. But
smaller batches /in parallel/ sounds like a plan.

Any idea how many concurrent requests I could fire at ES before I see
problems? I'll be testing this myself today but a starter for 10
would be appreciated.

--
This email, including attachments, is private and confidential. If you have
received this email in error please notify the sender and delete it from
your system. Emails are not secure and may contain viruses. No liability
can be accepted for viruses that might be transferred by this email or any
attachment. Any unauthorised copying of this message or unauthorised
distribution and publication of the information contained herein are
prohibited. 7digital Limited. Registered office: Unit F, Lower Ground
Floor, 5-25 Scrutton Street, Zetland House London EC2A 4HJ. Registered in
England and Wales. Registered No. 04843573.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

1 Like

For bulk-loading 90M documents (all "index", no "delete"), here's what I
did:

  1. Wrote a Java-based bulk loader to get around curl limitations. No big
    deal, but works very nicely.

  2. Bulk size is 4096. The index is configured with a 1s refresh time, but
    this is bumped temporarily to 120s during the load.

  3. Defined 16 shards, 1 replica (but that replica isn't active).

  4. One client thread serially loading the input JSON (action_and_meta_data
    line followed by the source line) in chunks of 4096 "index" actions.

Total time is between 2:46 and 3:20 (that's hours:minutes) on a laptop with
quad i7 CPUs but one slow disk that contains both the input data and the ES
database itself.

"Between", as in, when I made many runs of delete and then reload for
testing and tuning, those are the recorded low and high times. I'm guessing
the variance has more to do with periodic Time Machine activity during the
run.

On Wednesday, February 27, 2013 1:36:45 PM UTC-5, RKM wrote:

We found a bulk api setting of 2500 to be around optimal for us. It also
depends on how big your docs are (how big are they?). As other posters have
mentioned setting the bulk too high merely consumes your resources and does
not actually get things done faster. We use four threads, but we're
actually trying to keep docs going in at a steady rate since the cluster is
live on another index while building the new one. We index 30m docs in less
than 6 hours (index is 70gb). But our cluster is larger: 8 machines each
with one shard.

Also we found that 32gb of memory per instance was not enough--we ended up
with 63gb per. Its true also that the least possible JVM heap should be
given to elastic--the file system cache is a much better use for that
memory. Our jvm heap is set to 20gb out of that 63gb.

Randy

On Wed, Feb 27, 2013 at 5:06 AM, <james...@7digital.com <javascript:>>wrote:

Yeah I think one of my problems here is that I'm feeling a bit under
pressure and letting my engineering best practices slip. You're absolutely
right, I should be measuring everything, taking small steps and finding my
own sweet spot rather than looking for a silver bullet.

Thanks for the advice guys, I'll write a reply post with my findings when
I'm done for anyone else who comes across this.

Regards,
James

On Wednesday, February 27, 2013 12:52:05 PM UTC, Jörg Prante wrote:

Just a note from my experience:

It's not helpful to start a bulk index against a single node with 100
threads in parallel and 1000 docs in each request.

Start small and scale out until you find a "sweet spot".

Estimate the bulk capacity first. Put up a bulk request with 10, 100,
1000 ... docs and do a complete single thread bulk run. Measure the
average response time of a bulk request (not the "took" time in the bulk
response). Compute the size of a bulk request in bytes, and you have a
rough number of what a single thread can index against a single node.
The response time is proportional to your disk subsystem performance.

Now, check how many CPU cores your node is running, that is my rule of
thumb. Check the network bandwith also, if you bulk index from remote.
You can now go up to approx. 2x or 3x of the number of the total CPU
cores of the nodes you connect to, or the maximum capacity of the
network (compression is on by default), whatever comes first.

If you go higher, your bulk index will be slow (because of the high
load), or may even stall sooner or later because of the congestion.

Best regards,

Jörg

Am 27.02.13 13:23, schrieb James Lewis:

Hey Clinton,

Cheers for the memory advice - I'll get that sorted.

By multiple feeders you mean I could create a batch of 100,000 docs
and then bulk index the batch in chunks of about 1000 in parallel?
That sounds like a great idea - I settled on batches of 100,000
because I calculated that any less and the bulk indexing operation
would take more than 24 hours in total which isn't acceptable. But
smaller batches /in parallel/ sounds like a plan.

Any idea how many concurrent requests I could fire at ES before I see
problems? I'll be testing this myself today but a starter for 10
would be appreciated.

--
This email, including attachments, is private and confidential. If you
have
received this email in error please notify the sender and delete it from
your system. Emails are not secure and may contain viruses. No liability
can be accepted for viruses that might be transferred by this email or any
attachment. Any unauthorised copying of this message or unauthorised
distribution and publication of the information contained herein are
prohibited. 7digital Limited. Registered office: Unit F, Lower Ground
Floor, 5-25 Scrutton Street, Zetland House London EC2A 4HJ. Registered in
England and Wales. Registered No. 04843573.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi James, for comparison, we can index 50 million items on a very simple
3-node ES cluster, each with only 4Gb Heap allocated (the box has lots of
physical RAM though, why the heap is so low I won't bother explaining
because it's a long story), a default 5 shard index and having 18 threads
going on another server pulling data out of our database, pushing 10,000
items per _bulk call per thread, we can index about 1000-1500 items/second
per thread, roughly 18,000 items per second combined. We disable replicas,
snapshots, and up the refresh interval as you've done.

Our biggest performance challenge is simply getting it out of the DB fast
enough.

I'm with Clinton, drop your Heap usage down, leave some for disk buffering
and drop your bulk payload batch size right down to 10,000 or below.

cheers,

Paul

On 27 February 2013 23:23, James Lewis james.lewis@7digital.com wrote:

Hey Clinton,

Cheers for the memory advice - I'll get that sorted.

By multiple feeders you mean I could create a batch of 100,000 docs and
then bulk index the batch in chunks of about 1000 in parallel? That sounds
like a great idea - I settled on batches of 100,000 because I calculated
that any less and the bulk indexing operation would take more than 24 hours
in total which isn't acceptable. But smaller batches in parallelsounds like a plan.

Any idea how many concurrent requests I could fire at ES before I see
problems? I'll be testing this myself today but a starter for 10 would be
appreciated.

Cheers,
James

On Wed, Feb 27, 2013 at 12:10 PM, Clinton Gormley clint@traveljury.comwrote:

Hiya

We have two nodes, running elasticsearch v0.20.4 with 1 index split
across 100 shards and 1 replica of the index (we've over allocated and
we use routing / aliases when indexing / searching). This is the
elasticsearch.yml config on both of our nodes and this is the
logging.yml for both the nodes. Both nodes are running oracle Java
version: 1.7.0_13 and both have 32GB RAM with 24GB allocated to the
JVM.

Side note: you want to leave about 50% of your RAM for the kernel
filesystem cache, so I'd set your heap to 16GB, not 24GB.

Background
Last night I ran a backfill into our production cluster, I tried to
index 50 million documents (avg doc size ~2kb) via the bulk API in
chunks of 100,000. I pointed my elasticsearch client at one of the
nodes in the cluster and left the backfill process running over night.
When I came in to inspect it this morning it had run for approx. 11hrs
and indexed ~3/4 of the 50 million docs and then hung. I had a look
at my backfill process logs and it had hung sending a request to the
bulk API.

Bulk loads of 100,000 docs seems very high to me - you may run into
memory issues. In my tests (based on my own documents) the sweet spot is
1000-5000 docs at a time, although you can run multiple "feeders".

clint

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/PQizsDan8Zc/unsubscribe?hl=en-US
.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

--
This email, including attachments, is private and confidential. If you have
received this email in error please notify the sender and delete it from
your system. Emails are not secure and may contain viruses. No liability
can be accepted for viruses that might be transferred by this email or any
attachment. Any unauthorised copying of this message or unauthorised
distribution and publication of the information contained herein are
prohibited. 7digital Limited. Registered office: Unit F, Lower Ground
Floor, 5-25 Scrutton Street, Zetland House London EC2A 4HJ. Registered in
England and Wales. Registered No. 04843573.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Cheers again guys, it's really helpful reading about others experiences.
@InquiringMind - those times sound incredible! I'm looking at 15 hours at
the moment - I'm thinking perhaps the documents I'm trying to index are
much more complex? @PaulSmith - same to you, those times sound pretty
amazing too. What do the documents look like that you guys are indexing?
Does the complexity of the document even affect bulk indexing time? This
is another thing I'm not fully aware of - anyone know of something that
explains how the indexing process works? I'm guessing a simple search for
how lucene indexing works will suffice?

Thanks for all the help,
James

On Wed, Feb 27, 2013 at 8:44 PM, Paul Smith tallpsmith@gmail.com wrote:

Hi James, for comparison, we can index 50 million items on a very simple
3-node ES cluster, each with only 4Gb Heap allocated (the box has lots of
physical RAM though, why the heap is so low I won't bother explaining
because it's a long story), a default 5 shard index and having 18 threads
going on another server pulling data out of our database, pushing 10,000
items per _bulk call per thread, we can index about 1000-1500 items/second
per thread, roughly 18,000 items per second combined. We disable replicas,
snapshots, and up the refresh interval as you've done.

Our biggest performance challenge is simply getting it out of the DB fast
enough.

I'm with Clinton, drop your Heap usage down, leave some for disk buffering
and drop your bulk payload batch size right down to 10,000 or below.

cheers,

Paul

On 27 February 2013 23:23, James Lewis james.lewis@7digital.com wrote:

Hey Clinton,

Cheers for the memory advice - I'll get that sorted.

By multiple feeders you mean I could create a batch of 100,000 docs and
then bulk index the batch in chunks of about 1000 in parallel? That sounds
like a great idea - I settled on batches of 100,000 because I calculated
that any less and the bulk indexing operation would take more than 24 hours
in total which isn't acceptable. But smaller batches in parallelsounds like a plan.

Any idea how many concurrent requests I could fire at ES before I see
problems? I'll be testing this myself today but a starter for 10 would be
appreciated.

Cheers,
James

On Wed, Feb 27, 2013 at 12:10 PM, Clinton Gormley clint@traveljury.comwrote:

Hiya

We have two nodes, running elasticsearch v0.20.4 with 1 index split
across 100 shards and 1 replica of the index (we've over allocated and
we use routing / aliases when indexing / searching). This is the
elasticsearch.yml config on both of our nodes and this is the
logging.yml for both the nodes. Both nodes are running oracle Java
version: 1.7.0_13 and both have 32GB RAM with 24GB allocated to the
JVM.

Side note: you want to leave about 50% of your RAM for the kernel
filesystem cache, so I'd set your heap to 16GB, not 24GB.

Background
Last night I ran a backfill into our production cluster, I tried to
index 50 million documents (avg doc size ~2kb) via the bulk API in
chunks of 100,000. I pointed my elasticsearch client at one of the
nodes in the cluster and left the backfill process running over night.
When I came in to inspect it this morning it had run for approx. 11hrs
and indexed ~3/4 of the 50 million docs and then hung. I had a look
at my backfill process logs and it had hung sending a request to the
bulk API.

Bulk loads of 100,000 docs seems very high to me - you may run into
memory issues. In my tests (based on my own documents) the sweet spot is
1000-5000 docs at a time, although you can run multiple "feeders".

clint

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/PQizsDan8Zc/unsubscribe?hl=en-US
.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

--
This email, including attachments, is private and confidential. If you
have
received this email in error please notify the sender and delete it from
your system. Emails are not secure and may contain viruses. No liability
can be accepted for viruses that might be transferred by this email or any
attachment. Any unauthorised copying of this message or unauthorised
distribution and publication of the information contained herein are
prohibited. 7digital Limited. Registered office: Unit F, Lower Ground
Floor, 5-25 Scrutton Street, Zetland House London EC2A 4HJ. Registered in
England and Wales. Registered No. 04843573.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/PQizsDan8Zc/unsubscribe?hl=en-US
.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--

This email, including attachments, is private and confidential. If you have
received this email in error please notify the sender and delete it from
your system. Emails are not secure and may contain viruses. No liability
can be accepted for viruses that might be transferred by this email or any
attachment. Any unauthorised copying of this message or unauthorised
distribution and publication of the information contained herein are
prohibited. 7digital Limited. Registered office: Unit F, Lower Ground
Floor, 5-25 Scrutton Street, Zetland House London EC2A 4HJ. Registered in
England and Wales. Registered No. 04843573.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

our payload isn't big, each doc is about 20 fields, with a 50/50 split of
small textual strings (descriptions, labels etc) with numeric ID values and
a scattering of date/times. hundreds of bytes per record probably.

On 28 February 2013 08:37, James Lewis james.lewis@7digital.com wrote:

Cheers again guys, it's really helpful reading about others experiences.
@InquiringMind - those times sound incredible! I'm looking at 15 hours at
the moment - I'm thinking perhaps the documents I'm trying to index are
much more complex? @PaulSmith - same to you, those times sound pretty
amazing too. What do the documents look like that you guys are indexing?
Does the complexity of the document even affect bulk indexing time? This
is another thing I'm not fully aware of - anyone know of something that
explains how the indexing process works? I'm guessing a simple search for
how lucene indexing works will suffice?

Thanks for all the help,
James

On Wed, Feb 27, 2013 at 8:44 PM, Paul Smith tallpsmith@gmail.com wrote:

Hi James, for comparison, we can index 50 million items on a very simple
3-node ES cluster, each with only 4Gb Heap allocated (the box has lots of
physical RAM though, why the heap is so low I won't bother explaining
because it's a long story), a default 5 shard index and having 18 threads
going on another server pulling data out of our database, pushing 10,000
items per _bulk call per thread, we can index about 1000-1500 items/second
per thread, roughly 18,000 items per second combined. We disable replicas,
snapshots, and up the refresh interval as you've done.

Our biggest performance challenge is simply getting it out of the DB fast
enough.

I'm with Clinton, drop your Heap usage down, leave some for disk
buffering and drop your bulk payload batch size right down to 10,000 or
below.

cheers,

Paul

On 27 February 2013 23:23, James Lewis james.lewis@7digital.com wrote:

Hey Clinton,

Cheers for the memory advice - I'll get that sorted.

By multiple feeders you mean I could create a batch of 100,000 docs and
then bulk index the batch in chunks of about 1000 in parallel? That sounds
like a great idea - I settled on batches of 100,000 because I calculated
that any less and the bulk indexing operation would take more than 24 hours
in total which isn't acceptable. But smaller batches in parallelsounds like a plan.

Any idea how many concurrent requests I could fire at ES before I see
problems? I'll be testing this myself today but a starter for 10 would be
appreciated.

Cheers,
James

On Wed, Feb 27, 2013 at 12:10 PM, Clinton Gormley clint@traveljury.comwrote:

Hiya

We have two nodes, running elasticsearch v0.20.4 with 1 index split
across 100 shards and 1 replica of the index (we've over allocated and
we use routing / aliases when indexing / searching). This is the
elasticsearch.yml config on both of our nodes and this is the
logging.yml for both the nodes. Both nodes are running oracle Java
version: 1.7.0_13 and both have 32GB RAM with 24GB allocated to the
JVM.

Side note: you want to leave about 50% of your RAM for the kernel
filesystem cache, so I'd set your heap to 16GB, not 24GB.

Background
Last night I ran a backfill into our production cluster, I tried to
index 50 million documents (avg doc size ~2kb) via the bulk API in
chunks of 100,000. I pointed my elasticsearch client at one of the
nodes in the cluster and left the backfill process running over night.
When I came in to inspect it this morning it had run for approx. 11hrs
and indexed ~3/4 of the 50 million docs and then hung. I had a look
at my backfill process logs and it had hung sending a request to the
bulk API.

Bulk loads of 100,000 docs seems very high to me - you may run into
memory issues. In my tests (based on my own documents) the sweet spot is
1000-5000 docs at a time, although you can run multiple "feeders".

clint

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/PQizsDan8Zc/unsubscribe?hl=en-US
.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

--
This email, including attachments, is private and confidential. If you
have
received this email in error please notify the sender and delete it from
your system. Emails are not secure and may contain viruses. No liability
can be accepted for viruses that might be transferred by this email or
any
attachment. Any unauthorised copying of this message or unauthorised
distribution and publication of the information contained herein are
prohibited. 7digital Limited. Registered office: Unit F, Lower Ground
Floor, 5-25 Scrutton Street, Zetland House London EC2A 4HJ. Registered in
England and Wales. Registered No. 04843573.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/PQizsDan8Zc/unsubscribe?hl=en-US
.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
This email, including attachments, is private and confidential. If you have
received this email in error please notify the sender and delete it from
your system. Emails are not secure and may contain viruses. No liability
can be accepted for viruses that might be transferred by this email or any
attachment. Any unauthorised copying of this message or unauthorised
distribution and publication of the information contained herein are
prohibited. 7digital Limited. Registered office: Unit F, Lower Ground
Floor, 5-25 Scrutton Street, Zetland House London EC2A 4HJ. Registered in
England and Wales. Registered No. 04843573.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

My documents aren't overly complex: Mostly names, addresses, city and
state, plus some other information. But all of the name fields are
lowercased and stemmed using the English snowball analyzer:

  • 5 string fields are names that are stemmed with the standard tokenizer
    and the standard, lowercase, and English snowball filter.
  • 6 fields are other strings (dates and such) that are mapped using the
    standard tokenizer with only the standard and lowercase filters.
  • 2 fields are mapped as long integers.

When I first loaded these same 90M documents, my elapsed time was just over
11 hours.

I then made two changes: increase the index refresh to 120s during the
load, and configure 16 shards instead of 5. Then my first re-run of the
load (after deleting the index and re-creating it with the mappings, of
course) was 2:46 (2 hours and 46 minutes).

Again, my bulk loader is rather simplistic: Create a BulkRequestBuilder,
fill it with 4096 documents and run execute().actionGet() against it, show
any errors (usually there aren't any), then create a new BulkRequestBuilder
and repeat. If there are superior ways to use this class, I'd really like
to see some examples. I based this on what I could find and haven't
explored different parameters because what I have now works well enough
even if it's not optimum. But optimum is where I'll need more information
that the current Javadoc provides.

On Wednesday, February 27, 2013 4:37:21 PM UTC-5, James Lewis wrote:

Cheers again guys, it's really helpful reading about others experiences.
@InquiringMind - those times sound incredible! I'm looking at 15 hours at
the moment - I'm thinking perhaps the documents I'm trying to index are
much more complex? @PaulSmith - same to you, those times sound pretty
amazing too. What do the documents look like that you guys are indexing?
Does the complexity of the document even affect bulk indexing time? This
is another thing I'm not fully aware of - anyone know of something that
explains how the indexing process works? I'm guessing a simple search for
how lucene indexing works will suffice?

Thanks for all the help,
James

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

So from what everyone's said I should be:

  • setting the refresh interval high (or off, -1).
  • setting the merge factor to something high (30 or even higher - read a
    lucene article that recommends 1000 for large batches bu I haven't
    experimented with higher numbers yet).
  • figuring out what my sweet spot is in terms of batch size with a single
    feeder thread. Somewhere between 1000 and 5000.
  • once I've got a sweet spot, increase threads to around 2 or 3 times the
    number of cores on the server (so max of 6 in my case).
  • run the backfill.

So far I've spent a day trying to find that sweet spot. I'm taking a
subset of 100000 and trying to index them in batches with varying batch
sizes. I have found that if I go above 1000 the total time taken starts to
trail off. with a batch of 1000 I was seeing an average indexing speed of
around 2 seconds. I then started running batches in parallel (I'm using
C#, so task parallel library) and I saw no improvement in speed... this
could be my shoddy coding though, I'm yet to look into that. Anyway,
currently it's going to take about 26 hours for the full ingestion so I
need to spend more time figuring out that parallel bit. I'm so jealous of
your times!

Finally, I'm still wondering if our documents are quite complex and how
this can affect indexing speed. I've made a gist of an example doc that
we're indexing here https://gist.github.com/getsometoast/5056507 - if
anyone would care to just glance at it and let me know if they think it's
complex or not I'd greatly appreciate it.

Regards,
James

On Wed, Feb 27, 2013 at 10:10 PM, InquiringMind brian.from.fl@gmail.comwrote:

My documents aren't overly complex: Mostly names, addresses, city and
state, plus some other information. But all of the name fields are
lowercased and stemmed using the English snowball analyzer:

  • 5 string fields are names that are stemmed with the standard
    tokenizer and the standard, lowercase, and English snowball filter.
  • 6 fields are other strings (dates and such) that are mapped using
    the standard tokenizer with only the standard and lowercase filters.
  • 2 fields are mapped as long integers.

When I first loaded these same 90M documents, my elapsed time was just
over 11 hours.

I then made two changes: increase the index refresh to 120s during the
load, and configure 16 shards instead of 5. Then my first re-run of the
load (after deleting the index and re-creating it with the mappings, of
course) was 2:46 (2 hours and 46 minutes).

Again, my bulk loader is rather simplistic: Create a BulkRequestBuilder,
fill it with 4096 documents and run execute().actionGet() against it, show
any errors (usually there aren't any), then create a new BulkRequestBuilder
and repeat. If there are superior ways to use this class, I'd really like
to see some examples. I based this on what I could find and haven't
explored different parameters because what I have now works well enough
even if it's not optimum. But optimum is where I'll need more information
that the current Javadoc provides.

On Wednesday, February 27, 2013 4:37:21 PM UTC-5, James Lewis wrote:

Cheers again guys, it's really helpful reading about others experiences.
@InquiringMind - those times sound incredible! I'm looking at 15 hours at
the moment - I'm thinking perhaps the documents I'm trying to index are
much more complex? @PaulSmith - same to you, those times sound pretty
amazing too. What do the documents look like that you guys are indexing?
Does the complexity of the document even affect bulk indexing time? This
is another thing I'm not fully aware of - anyone know of something that
explains how the indexing process works? I'm guessing a simple search for
how lucene indexing works will suffice?

Thanks for all the help,
James

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/PQizsDan8Zc/unsubscribe?hl=en-US
.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--

This email, including attachments, is private and confidential. If you have
received this email in error please notify the sender and delete it from
your system. Emails are not secure and may contain viruses. No liability
can be accepted for viruses that might be transferred by this email or any
attachment. Any unauthorised copying of this message or unauthorised
distribution and publication of the information contained herein are
prohibited. 7digital Limited. Registered office: Unit F, Lower Ground
Floor, 5-25 Scrutton Street, Zetland House London EC2A 4HJ. Registered in
England and Wales. Registered No. 04843573.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

My documents are like this example.json · GitHub

Right now on the production environment I am indexing with 500-1500 docs
per second on 3 nodes, bulk ingest takes around 70-90 mins for indexing
20 mio. docs. 4 GB heap size, 24 shards (12 with replica level 1), but
no change of refresh settings or segment merging. It takes significant
more time to index on an existing index than on an empty index. Most CPU
time is spent on creating the JSON sources form the originating format.

I use Java TransportClient on a 0.19 ES cluster with improvements for
concurrent bulk indexing (a single ES client instance, indexing from 12
parallel threads reading from bzip2 files)

Recent experiments on test environment gave results that I can ramp up
bulk indexing much higher, to an average of 10000 documents per second.
This will be productive as soon as I can upgrade to Elasticsearch 1.00
with Lucene 4.x :slight_smile:

Jörg

Am 28.02.13 13:50, schrieb James Lewis:

Finally, I'm still wondering if our documents are quite complex and
how this can affect indexing speed. I've made a gist of an example
doc that we're indexing here
https://gist.github.com/getsometoast/5056507 - if anyone would care
to just glance at it and let me know if they think it's complex or not
I'd greatly appreciate it.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

In my benchmarking tests, disk I/O was almost always the bottleneck. After
you make the recommended changes in this thread (no replicas, refresh rate,
etc), keep an eye on the IOPS of your disks. If/when you saturate...that's
pretty much your indexing capacity without adding more nodes. I spent a
lot of time fiddling when in reality my disks were just saturated, and
there was nothing to be done about it.

Anecdotally, I've seen that as your doc size grows (or complexity, e.g.
shingles, which can produce a lot of indexing for a given field), the
performance gain from bulk loading becomes less. Presumably because
processing/routing between nodes takes proportionally less time of the
total request, which is dominated by simple Disk I/O. In some tests I've
hit a point where bulk indexing speed is identical to individual inserts.

I'm saying anectodotally because it's possible the results were a result of
sloppy code on my end =)

-Zach

On Thursday, February 28, 2013 8:50:49 AM UTC-5, Jörg Prante wrote:

My documents are like this example.json · GitHub

Right now on the production environment I am indexing with 500-1500 docs
per second on 3 nodes, bulk ingest takes around 70-90 mins for indexing
20 mio. docs. 4 GB heap size, 24 shards (12 with replica level 1), but
no change of refresh settings or segment merging. It takes significant
more time to index on an existing index than on an empty index. Most CPU
time is spent on creating the JSON sources form the originating format.

I use Java TransportClient on a 0.19 ES cluster with improvements for
concurrent bulk indexing (a single ES client instance, indexing from 12
parallel threads reading from bzip2 files)

Recent experiments on test environment gave results that I can ramp up
bulk indexing much higher, to an average of 10000 documents per second.
This will be productive as soon as I can upgrade to Elasticsearch 1.00
with Lucene 4.x :slight_smile:

Jörg

Am 28.02.13 13:50, schrieb James Lewis:

Finally, I'm still wondering if our documents are quite complex and
how this can affect indexing speed. I've made a gist of an example
doc that we're indexing here
https://gist.github.com/getsometoast/5056507 - if anyone would care
to just glance at it and let me know if they think it's complex or not
I'd greatly appreciate it.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

On Thursday, February 28, 2013 7:50:46 AM UTC-5, James Lewis wrote:

So from what everyone's said I should be:

  • setting the refresh interval high (or off, -1).
  • setting the merge factor to something high (30 or even higher - read a
    lucene article that recommends 1000 for large batches bu I haven't
    experimented with higher numbers yet).
  • figuring out what my sweet spot is in terms of batch size with a single
    feeder thread. Somewhere between 1000 and 5000.
  • once I've got a sweet spot, increase threads to around 2 or 3 times the
    number of cores on the server (so max of 6 in my case).
  • run the backfill.

From what I've read in the on-line docs and newsgroups: Best load
performance is with more shards, and best query performance is with more
replicas.

I don't know how Lucene does this under the covers, but when I look under
the hoods of other high-performance query engines I noticed that as soon as
the underlying B+ trees go from 3 levels deep to 4, there is a sharp
decrease in update and query performance. So I'm guessing that a Lucene
index behaves somewhat similarly, and by increasing the number of shards
the B-tree depths can be kept low.

So far I've spent a day trying to find that sweet spot. I'm taking a
subset of 100000 and trying to index them in batches with varying batch
sizes. I have found that if I go above 1000 the total time taken starts to
trail off. with a batch of 1000 I was seeing an average indexing speed of
around 2 seconds. I then started running batches in parallel (I'm using
C#, so task parallel library) and I saw no improvement in speed... this
could be my shoddy coding though, I'm yet to look into that. Anyway,
currently it's going to take about 26 hours for the full ingestion so I
need to spend more time figuring out that parallel bit. I'm so jealous of
your times!

Finally, I'm still wondering if our documents are quite complex and how
this can affect indexing speed. I've made a gist of an example doc that
we're indexing here https://gist.github.com/getsometoast/5056507 - if
anyone would care to just glance at it and let me know if they think it's
complex or not I'd greatly appreciate it.

I tend to keep things simple. For example, my documents are basically {
"field" : value, ... } where value can be a single value or it can be an
array of [ value, value, ... ]. In an array, values can be of mixed types.
For example:

"misc" : [ "A phrase of text", true, null, 45, -12.271 ]

Yes, you can toss in any document of any complexity. So this is my own
$0.02 US, and not necessarily a recommendation for anyone else.

Regards,
James

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Best load performance is with more shards, and best query performance is
with more replicas.

This is interesting - we use routing to make sure that all of a users
documents will be indexed in the same shard. So we're not just hitting the
bulk api with documents, we're also supplying a route for each document
based on the user id. Anyone know how this might have an impact on the
performance of a batch load?

Cheers,
James

On Friday, March 1, 2013 1:29:13 AM UTC, InquiringMind wrote:

On Thursday, February 28, 2013 7:50:46 AM UTC-5, James Lewis wrote:

So from what everyone's said I should be:

  • setting the refresh interval high (or off, -1).
  • setting the merge factor to something high (30 or even higher - read a
    lucene article that recommends 1000 for large batches bu I haven't
    experimented with higher numbers yet).
  • figuring out what my sweet spot is in terms of batch size with a
    single feeder thread. Somewhere between 1000 and 5000.
  • once I've got a sweet spot, increase threads to around 2 or 3 times
    the number of cores on the server (so max of 6 in my case).
  • run the backfill.

From what I've read in the on-line docs and newsgroups: Best load
performance is with more shards, and best query performance is with more
replicas.

I don't know how Lucene does this under the covers, but when I look under
the hoods of other high-performance query engines I noticed that as soon as
the underlying B+ trees go from 3 levels deep to 4, there is a sharp
decrease in update and query performance. So I'm guessing that a Lucene
index behaves somewhat similarly, and by increasing the number of shards
the B-tree depths can be kept low.

So far I've spent a day trying to find that sweet spot. I'm taking a
subset of 100000 and trying to index them in batches with varying batch
sizes. I have found that if I go above 1000 the total time taken starts to
trail off. with a batch of 1000 I was seeing an average indexing speed of
around 2 seconds. I then started running batches in parallel (I'm using
C#, so task parallel library) and I saw no improvement in speed... this
could be my shoddy coding though, I'm yet to look into that. Anyway,
currently it's going to take about 26 hours for the full ingestion so I
need to spend more time figuring out that parallel bit. I'm so jealous of
your times!

Finally, I'm still wondering if our documents are quite complex and how
this can affect indexing speed. I've made a gist of an example doc that
we're indexing here https://gist.github.com/getsometoast/5056507 - if
anyone would care to just glance at it and let me know if they think it's
complex or not I'd greatly appreciate it.

I tend to keep things simple. For example, my documents are basically {
"field" : value, ... } where value can be a single value or it can be an
array of [ value, value, ... ]. In an array, values can be of mixed
types. For example:

"misc" : [ "A phrase of text", true, null, 45, -12.271 ]

Yes, you can toss in any document of any complexity. So this is my own
$0.02 US, and not necessarily a recommendation for anyone else.

Regards,
James

--

This email, including attachments, is private and confidential. If you have
received this email in error please notify the sender and delete it from
your system. Emails are not secure and may contain viruses. No liability
can be accepted for viruses that might be transferred by this email or any
attachment. Any unauthorised copying of this message or unauthorised
distribution and publication of the information contained herein are
prohibited. 7digital Limited. Registered office: Unit F, Lower Ground
Floor, 5-25 Scrutton Street, Zetland House London EC2A 4HJ. Registered in
England and Wales. Registered No. 04843573.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Lucene is faster, it uses a data structure similar to B-Tree, but the
leaves are always cached in memory, for the text indexing. And there are
also no classic B-Trees in Elastisearch because it would be slow.

Look at Doug Cutting's post

http://mail-archives.apache.org/mod_mbox/lucene-java-user/200601.mbox/<43C6B934.3030500@apache.org>

Quoting: "B-Tree's are best for random, incremental updates. They
require log_b(N) disk accesses for inserts, deletes and accesses, where
b is the
number of entries per page, and N is the total number of entries in the
tree. But that's too slow for text indexing. Rather Lucene uses a
combination of file sorting and merging to update indexes, which is much
faster than a B-tree would be. For access, Lucene is equivalent to a
B-Tree with all but the leaves cached in memory, so that accesses
require only a single disk access. "

Jörg

Am 01.03.13 02:29, schrieb InquiringMind:

I don't know how Lucene does this under the covers, but when I look
under the hoods of other high-performance query engines I noticed that
as soon as the underlying B+ trees go from 3 levels deep to 4, there
is a sharp decrease in update and query performance. So I'm guessing
that a Lucene index behaves somewhat similarly, and by increasing the
number of shards the B-tree depths can be kept low.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

It's a matter of distribution of the data. Look where your primay shards
are, since all indexing goes through the primary shards, and how much
resources are there (CPU cores, Memory). Check how your routing
distributes over the shards, it depends on the routing parameter, and if
the data volume of a user varies much in respect to the average user
data volume. If overall data distribution is well, there is no "bottleneck".

Increasing shards has an upper limit. You can increase the number of
shards as long as your machines can handle the distributed indexing
load. If the limit is exceeded, just add machines, it's as easy as that.

Jörg

Am 01.03.13 11:42, schrieb james.lewis@7digital.com:

This is interesting - we use routing to make sure that all of a users
documents will be indexed in the same shard. So we're not just
hitting the bulk api with documents, we're also supplying a route for
each document based on the user id. Anyone know how this might have
an impact on the performance of a batch load?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.