Elasticsearch is a distributed search engine. So search is distributed, but
also indexing. The more nodes you have, the better indexing throughput you
will get (assuming you index from multiple threads / multiple clients).
One option, as you suggested, is to linearly walk the data you have, and
index it. Even with a single process, you can, of course, parallelize the
indexing process (stuff indexing into a thread pool). If you can parallelize
your fetching of the data, you can do it as well.
An option to do a map/reduce over you data store is also certainly possible.
Just fork jobs, each job fetches (and possibly massages) the data, and index
it into elasticsearch.
In general, the more parallelism you get with your indexing process, the
better. You still use a single elasticsearch cluster with the index API
thanks to the fact that elasticsearch is distributed and highly concurrent
(on a single node).
Some notes to increase your indexing speed with elasticsearch:
- If you know in advance that the documents you index do not already exists
in it, use the opType create
with the index API.
- If you don't need near real time search of 1 second, increase it
(described here:
http://www.elasticsearch.com/docs/elasticsearch/index_modules/engine/robin/#Refresh
).
-shay.banon
On Thu, Mar 25, 2010 at 9:37 PM, Colin Surprenant <
colin.surprenant@gmail.com> wrote:
Hi,
What are the options for creating a new index form an existing very
large data set? do we need to linearly walk the data and insert each
document one-by-one?
Otherwise, given a distributed datastore with mapreduce support, would
it be possible to leverage such a framework to distribute the ES index
creation by launching mapreduce functions to, for example, compute
some new information over our existing data and create a new index
from it??
Thanks for your help,
Colin