hi group,
I've been indexing about 35M documents and have bumped into some issues
which I would like to get feedback about.
I have already scoured the mailing list and picked a few morsel of advice,
suc as in:
Bulk indexing tips - elasticsearch | Google
Groupshttp://groups.google.com/group/elasticsearch/browse_thread/thread/00f5b144bacc39e5
Improve query time - elasticsearch | Google
Groupshttp://groups.google.com/group/elasticsearch/browse_thread/thread/c7b4448008a10acc
so, what have I done:
as I stream over the data to be indexed (read from mongodb in my case), my
initial approach was to use a blocking index call. that works fine, but is
the slowest method, as expected.
my second approach was to use the async index call (fire and forget), but
then I encountered NoNodeAvailableException, and that was that. I couldn't
find much information about it, and I suspect that I simply overwhelmed the
indexing thread pool. I imagine that some tailoring as described here: Thread
Pool http://www.elasticsearch.org/guide/reference/modules/threadpool.html,
can help, but it wasn't clear to me how. so any advice here would be
appreciated. also, is there a java api for it or is it strictly
configuration and comes into effect only at boot time? (as in no dynamic
reconfiguration)
also, if it helps for posterity sake, when I failed upon
a NoNodeAvailableException, the failure was catastrophic:
the cluster went to a red-state and data was lost. granted, I used an index
with no replication. I actually had to shutdown the cluster + restart
elasticsearch.
my third approach was to use bulk indexing via an async call, but I still
had to find a way to throttle. the hack I chose to employ was to keep a
counter of how many have been sent to be indexed, and every so often (I
chose every batch-size) to check how many are "visible" in the index (I
used a count query for that). if it was under a certain bar (I chose
batch-size), I would take a nap and ask again, until the bar was met. so
essentially, I would lag up to a batch-size behind.
this works, but feels wrong. for example, I had to find the "right"
batch-size empirically. I would like to have some transparent throttling
built-in, or at the very least an api for asking about the "availability".
I'm not even sure I am making sense here, or that this is the right way to
think about it, so please enlighten me if you can.
the forth approach I plan to try is using a rabbitmq river. my
understanding is that it transparently throttles and bulks messages as
needed. if anyone took that approach, I would love to hear about it.
thoughts?
thanks in advance,
Benny