I have a use case where I'd like to be able to dump all the documents in
ES to a specific output format. However, using scan or any other
"consistent" view is relatively slow. Using the scan query with a
"match_all", it processes items at a rate of around 80,000 a second--but
that means it will still take over 5 hours to dump. It also means it can't
be parallelized across machines, which effectively stops scaling.
I've also looked at things like Knapsack, Elastidump, etc., but these still
don't give me the ability to parallelize the work, and they're not
particularly fast. They also don't allow me to manipulate it to the
specific format I want (it's not JSON, and requires some organization of
the data).
So I have a few ideas, which may or may not be possible:
- Retrieve shard-specific data from ElasticSearch (i.e., "Give me all
the data for Shard X"). This would allow me to divide the task up into /at
least/ S tasks, where S is the number of segments, but there doesn't seem
to be an API that exposes this. - Get snapshots of each shard from disk. This would also allow me to
divide up the work, but would also require a framework on top to coordinate
which segments have been retrieved, etc.. - Hadoop. However, launching an entire MR cluster just to dump data
sounds like overkill.
The first option gives me the most flexibility and would require the least
amount of work on my part, but there doesn't seem to be any way to dump all
the data for a specific shard via the API. Is there any sort of API or
flag that provides this, or otherwise provides a way to partition the data
to different consumers?
The second would also (assumingly) give me the ability to subdivide tasks
out per worker, and would also allow these to be done offline. I was able
to write a sample program that uses Lucene to do this, but this adds the
additional complexity of coordinating work across the various hosts in the
cluster, as well as requiring an intermediate step where I transfer the
common files to another host to combine them. This isn't a terrible
problem to have--but does require additional infrastructure to organize.
The third is not desirable because it's an incredible amount of operational
load without a clear tradeoff, since we don't already have a map reduce
cluster on hand.
Thanks for any tips or suggestions!
Andrew
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/97b93fb9-fa7b-4e82-922c-98e8fb48103b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.