Hi guys, I've got an issue I need a hand diagnosing to do with bulk
indexing. I have a feeling I could just be hammering the cluster too hard
but just in case it's something I've setup wrong I've included as much
information as possible.
We have two nodes, running elasticsearch v0.20.4 with 1 index split across
100 shards and 1 replica of the index (we've over allocated and we use
routing / aliases when indexing / searching). This is the
elasticsearch.yml config on both of our nodeshttps://gist.github.com/getsometoast/5047292and this
is the logging.yml for both the nodeshttps://gist.github.com/getsometoast/5047308.
Both nodes are running oracle Java version: 1.7.0_13 and both have 32GB RAM
with 24GB allocated to the JVM. So you can see the java environment
variables for the process here's a gist of it.https://gist.github.com/getsometoast/5047322
Last night I ran a backfill into our production cluster, I tried to index
50 million documents (avg doc size ~2kb) via the bulk API in chunks of
100,000. I pointed my elasticsearch client at one of the nodes in the
cluster and left the backfill process running over night. When I came in
to inspect it this morning it had run for approx. 11hrs and indexed ~3/4 of
the 50 million docs and then hung. I had a look at my backfill process
logs and it had hung sending a request to the bulk API.
I then inspected the logs on my elasticsearch nodes (I'm having a lot of
issues with elasticsearch logging but that's another topic). The log for
the node that I was sending my bulk index request to has nothing in it for
the majority of the time I was running the backfill for (everything is set
to debug in the yml so this confuses me..) However it does have some
entries at about 5pm yesterday evening which is shortly after I kicked off
the process and then some again this morning, long after the backfill
process started hanging - here's a gist of the outputhttps://gist.github.com/getsometoast/5047202.
The other node has nothing in its logs.
The other node, the one I wasn't sending the bulk index request to, is
currently using 77% of the memory allocated to it, way higher than the
other node. Here's the output from paramedic:
Just for completeness, here's how I run a backfill:
- set the index refresh to -1 and merge policy factor to 30
- read all the data I need into memory
- denormalize the data in 100,000 object batches
- create a bulk request for the batch
- send the bulk request to one of my elasticsearch nodes (hard coded
address - last night I used the current master node in the cluster)
- when finished processing all data, set the index refresh to 1s and
merge factor to 10
Sorry if all that was a bit long winded - I just wanted to get everything I
know about the state of the system down to see if anyone could spot
anything weird I might be doing. My question is, why would a request to
the bulk API hang and never timeout and when this happens why would I not
see any clear sign of error on the cluster?
Any help, questions etc greatly appreciated as this is starting to block
This email, including attachments, is private and confidential. If you have
received this email in error please notify the sender and delete it from
your system. Emails are not secure and may contain viruses. No liability
can be accepted for viruses that might be transferred by this email or any
attachment. Any unauthorised copying of this message or unauthorised
distribution and publication of the information contained herein are
prohibited. 7digital Limited. Registered office: Unit F, Lower Ground
Floor, 5-25 Scrutton Street, Zetland House London EC2A 4HJ. Registered in
England and Wales. Registered No. 04843573.
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to email@example.com.
For more options, visit https://groups.google.com/groups/opt_out.