Seeking advice on massive indexing

hi group,

I've been indexing about 35M documents and have bumped into some issues
which I would like to get feedback about.
I have already scoured the mailing list and picked a few morsel of advice,
suc as in:
Bulk indexing tips - elasticsearch | Google
Groupshttp://groups.google.com/group/elasticsearch/browse_thread/thread/00f5b144bacc39e5
Improve query time - elasticsearch | Google
Groupshttp://groups.google.com/group/elasticsearch/browse_thread/thread/c7b4448008a10acc

so, what have I done:

as I stream over the data to be indexed (read from mongodb in my case), my
initial approach was to use a blocking index call. that works fine, but is
the slowest method, as expected.

my second approach was to use the async index call (fire and forget), but
then I encountered NoNodeAvailableException, and that was that. I couldn't
find much information about it, and I suspect that I simply overwhelmed the
indexing thread pool. I imagine that some tailoring as described here: Thread
Pool http://www.elasticsearch.org/guide/reference/modules/threadpool.html,
can help, but it wasn't clear to me how. so any advice here would be
appreciated. also, is there a java api for it or is it strictly
configuration and comes into effect only at boot time? (as in no dynamic
reconfiguration)
also, if it helps for posterity sake, when I failed upon
a NoNodeAvailableException, the failure was catastrophic:
the cluster went to a red-state and data was lost. granted, I used an index
with no replication. I actually had to shutdown the cluster + restart
elasticsearch.

my third approach was to use bulk indexing via an async call, but I still
had to find a way to throttle. the hack I chose to employ was to keep a
counter of how many have been sent to be indexed, and every so often (I
chose every batch-size) to check how many are "visible" in the index (I
used a count query for that). if it was under a certain bar (I chose
batch-size), I would take a nap and ask again, until the bar was met. so
essentially, I would lag up to a batch-size behind.

this works, but feels wrong. for example, I had to find the "right"
batch-size empirically. I would like to have some transparent throttling
built-in, or at the very least an api for asking about the "availability".
I'm not even sure I am making sense here, or that this is the right way to
think about it, so please enlighten me if you can.

the forth approach I plan to try is using a rabbitmq river. my
understanding is that it transparently throttles and bulks messages as
needed. if anyone took that approach, I would love to hear about it.

thoughts?

thanks in advance,
Benny

Hi,

You can overwhelm the cluster if you send too many concurrent bulk
requests. There is an option to configure the thread pool to disallow this,
by configuring the number of bulk indexing threads. You do need to do some
work yourself, and making sure you won't overwhelm cluster as well. You can
send a sync bulk indexing requests, or send async ones (or blocking ones
from different threads), just make sure you don't have too many of those.
How many? Really depends on your cluster and machines spec.

I know, it would be really nice to do automatic throttling on the
cluster end, but you (the user) need to do some of the work currently...

-shay.banon

On Sun, Apr 8, 2012 at 4:40 AM, Benny Sadeh benny.sadeh@gmail.com wrote:

hi group,

I've been indexing about 35M documents and have bumped into some issues
which I would like to get feedback about.
I have already scoured the mailing list and picked a few morsel of advice,
suc as in:
Bulk indexing tips - elasticsearch | Google Groupshttp://groups.google.com/group/elasticsearch/browse_thread/thread/00f5b144bacc39e5
Improve query time - elasticsearch | Google Groupshttp://groups.google.com/group/elasticsearch/browse_thread/thread/c7b4448008a10acc

so, what have I done:

as I stream over the data to be indexed (read from mongodb in my case), my
initial approach was to use a blocking index call. that works fine, but is
the slowest method, as expected.

my second approach was to use the async index call (fire and forget), but
then I encountered NoNodeAvailableException, and that was that. I couldn't
find much information about it, and I suspect that I simply overwhelmed the
indexing thread pool. I imagine that some tailoring as described here: Thread
Poolhttp://www.elasticsearch.org/guide/reference/modules/threadpool.html,
can help, but it wasn't clear to me how. so any advice here would be
appreciated. also, is there a java api for it or is it strictly
configuration and comes into effect only at boot time? (as in no dynamic
reconfiguration)
also, if it helps for posterity sake, when I failed upon
a NoNodeAvailableException, the failure was catastrophic:
the cluster went to a red-state and data was lost. granted, I used an
index with no replication. I actually had to shutdown the cluster + restart
elasticsearch.

my third approach was to use bulk indexing via an async call, but I still
had to find a way to throttle. the hack I chose to employ was to keep a
counter of how many have been sent to be indexed, and every so often (I
chose every batch-size) to check how many are "visible" in the index (I
used a count query for that). if it was under a certain bar (I chose
batch-size), I would take a nap and ask again, until the bar was met. so
essentially, I would lag up to a batch-size behind.

this works, but feels wrong. for example, I had to find the "right"
batch-size empirically. I would like to have some transparent throttling
built-in, or at the very least an api for asking about the "availability".
I'm not even sure I am making sense here, or that this is the right way to
think about it, so please enlighten me if you can.

the forth approach I plan to try is using a rabbitmq river. my
understanding is that it transparently throttles and bulks messages as
needed. if anyone took that approach, I would love to hear about it.

thoughts?

thanks in advance,
Benny