I see, thanks for sharing. If data is stored in Hadoop, I guess similar map
reduce jobs katta gave as samples can be used with elasticsearch. With riak,
I am not too familiar with it, but the same logic should apply.
-shay.banon
On Fri, Mar 26, 2010 at 5:07 PM, Colin Surprenant <
colin.surprenant@gmail.com> wrote:
In our hadoop+katta prototype, the data is simply in hdfs and we made
an index creating job based on the katta hadoop indexing job samples.What I am currently looking at is to use riak http://riak.basho.com/
for data storage and their mapreduce framework
https://wiki.basho.com/display/RIAK/MapReduce
to launch elasticsearch index creation jobs. I am trying evaluate what
will be the most efficient way to parallelize the index creation when
creating a new index over the complete data and what would be the best
integration point between riak mapreduce and elasticsearch.Colin
On Mar 26, 6:14 am, Shay Banon shay.ba...@elasticsearch.com wrote:
Not really sure I understood then. Where do you store your data that you
plan to run your map reduce jobs on? Hadoop for Lucene+Katta is not where
you store your data...-shay.banon
On Fri, Mar 26, 2010 at 7:23 AM, Colin Surprenant <
colin.surpren...@gmail.com> wrote:
Well, we do have a hadoop prototype but with lucene and katta. I am
currently loooking into riak.Colin
On Mar 25, 6:36 pm, Shay Banon shay.ba...@elasticsearch.com wrote:
Sounds perfect. Do you use Hadoop? If so, simply use the native Java
APIs
elasticsearch comes with and call the index API as part of your jobs.-shay.banon
On Fri, Mar 26, 2010 at 1:08 AM, Colin Surprenant <
colin.surpren...@gmail.com> wrote:
I mean leveraging a mapreduce framework over a distributed
datastore
to speedup the index creation.I have a very large dataset over which we run mapreduce tasks to
analyse and/or transform the data and for which we might want to
create new indexes for this new transformed/extracted data. The
most
efficient solution would be to create the indexes as part of the
mapreduce tasks to distribute the process.Colin
On Mar 25, 5:15 pm, the claw nick.minute...@gmail.com wrote:
Do you mean using Elastic Search to distribute the process of
loading/
transforming the data?
Or just the distribution of the indexing (which ES does).-Nick
On Mar 25, 7:37 pm, Colin Surprenant <colin.surpren...@gmail.com
wrote:
Hi,
What are the options for creating a new index form an existing
very
large data set? do we need to linearly walk the data and insert
each
document one-by-one?Otherwise, given a distributed datastore with mapreduce
support,
would
it be possible to leverage such a framework to distribute the
ES
index
creation by launching mapreduce functions to, for example,
compute
some new information over our existing data and create a new
index
from it??Thanks for your help,
Colin