Index creation from very large data set

I see, thanks for sharing. If data is stored in Hadoop, I guess similar map
reduce jobs katta gave as samples can be used with elasticsearch. With riak,
I am not too familiar with it, but the same logic should apply.

-shay.banon

On Fri, Mar 26, 2010 at 5:07 PM, Colin Surprenant <
colin.surprenant@gmail.com> wrote:

In our hadoop+katta prototype, the data is simply in hdfs and we made
an index creating job based on the katta hadoop indexing job samples.

What I am currently looking at is to use riak http://riak.basho.com/
for data storage and their mapreduce framework
https://wiki.basho.com/display/RIAK/MapReduce
to launch elasticsearch index creation jobs. I am trying evaluate what
will be the most efficient way to parallelize the index creation when
creating a new index over the complete data and what would be the best
integration point between riak mapreduce and elasticsearch.

Colin

On Mar 26, 6:14 am, Shay Banon shay.ba...@elasticsearch.com wrote:

Not really sure I understood then. Where do you store your data that you
plan to run your map reduce jobs on? Hadoop for Lucene+Katta is not where
you store your data...

-shay.banon

On Fri, Mar 26, 2010 at 7:23 AM, Colin Surprenant <

colin.surpren...@gmail.com> wrote:

Well, we do have a hadoop prototype but with lucene and katta. I am
currently loooking into riak.

Colin

On Mar 25, 6:36 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

Sounds perfect. Do you use Hadoop? If so, simply use the native Java
APIs
elasticsearch comes with and call the index API as part of your jobs.

-shay.banon

On Fri, Mar 26, 2010 at 1:08 AM, Colin Surprenant <

colin.surpren...@gmail.com> wrote:

I mean leveraging a mapreduce framework over a distributed
datastore
to speedup the index creation.

I have a very large dataset over which we run mapreduce tasks to
analyse and/or transform the data and for which we might want to
create new indexes for this new transformed/extracted data. The
most
efficient solution would be to create the indexes as part of the
mapreduce tasks to distribute the process.

Colin

On Mar 25, 5:15 pm, the claw nick.minute...@gmail.com wrote:

Do you mean using Elastic Search to distribute the process of
loading/
transforming the data?
Or just the distribution of the indexing (which ES does).

-Nick

On Mar 25, 7:37 pm, Colin Surprenant <colin.surpren...@gmail.com

wrote:

Hi,

What are the options for creating a new index form an existing
very
large data set? do we need to linearly walk the data and insert
each
document one-by-one?

Otherwise, given a distributed datastore with mapreduce
support,
would
it be possible to leverage such a framework to distribute the
ES
index
creation by launching mapreduce functions to, for example,
compute
some new information over our existing data and create a new
index
from it??

Thanks for your help,
Colin