Index creation from very large data set

kimchy · March 26, 2010, 2:51pm

I see, thanks for sharing. If data is stored in Hadoop, I guess similar map
reduce jobs katta gave as samples can be used with elasticsearch. With riak,
I am not too familiar with it, but the same logic should apply.

-shay.banon

On Fri, Mar 26, 2010 at 5:07 PM, Colin Surprenant <
colin.surprenant@gmail.com> wrote:

In our hadoop+katta prototype, the data is simply in hdfs and we made
an index creating job based on the katta hadoop indexing job samples.

What I am currently looking at is to use riak http://riak.basho.com/
for data storage and their mapreduce framework
https://wiki.basho.com/display/RIAK/MapReduce
to launch elasticsearch index creation jobs. I am trying evaluate what
will be the most efficient way to parallelize the index creation when
creating a new index over the complete data and what would be the best
integration point between riak mapreduce and elasticsearch.

Colin

On Mar 26, 6:14 am, Shay Banon shay.ba...@elasticsearch.com wrote:

Not really sure I understood then. Where do you store your data that you
plan to run your map reduce jobs on? Hadoop for Lucene+Katta is not where
you store your data...

-shay.banon

On Fri, Mar 26, 2010 at 7:23 AM, Colin Surprenant <

colin.surpren...@gmail.com> wrote:

Well, we do have a hadoop prototype but with lucene and katta. I am
currently loooking into riak.

Colin

On Mar 25, 6:36 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

Sounds perfect. Do you use Hadoop? If so, simply use the native Java
APIs
elasticsearch comes with and call the index API as part of your jobs.

-shay.banon

On Fri, Mar 26, 2010 at 1:08 AM, Colin Surprenant <

colin.surpren...@gmail.com> wrote:

I mean leveraging a mapreduce framework over a distributed
datastore
to speedup the index creation.

I have a very large dataset over which we run mapreduce tasks to
analyse and/or transform the data and for which we might want to
create new indexes for this new transformed/extracted data. The
most
efficient solution would be to create the indexes as part of the
mapreduce tasks to distribute the process.

Colin

On Mar 25, 5:15 pm, the claw nick.minute...@gmail.com wrote:

Do you mean using Elastic Search to distribute the process of
loading/
transforming the data?
Or just the distribution of the indexing (which ES does).

-Nick

On Mar 25, 7:37 pm, Colin Surprenant <colin.surpren...@gmail.com

wrote:

Hi,

What are the options for creating a new index form an existing
very
large data set? do we need to linearly walk the data and insert
each
document one-by-one?

Otherwise, given a distributed datastore with mapreduce
support,
would
it be possible to leverage such a framework to distribute the
ES
index
creation by launching mapreduce functions to, for example,
compute
some new information over our existing data and create a new
index
from it??

Thanks for your help,
Colin

Topic		Replies	Views
Offline indexing and expected scaling performance Elasticsearch	4	1850	July 6, 2017
How to generate ES index in the hadoop Elasticsearch	7	1278	July 6, 2017
Decoupling Data and indexing Elasticsearch	8	860	July 6, 2017
How to deal with building huge bulk load indices fast without impacting prod queries or paying a fortune to over-provision the cluster Elasticsearch	10	3690	July 5, 2017
Is anyway to bulk huge data to ES without rest Elasticsearch	17	576	July 6, 2017

Index creation from very large data set

Related topics