Index creation from very large data set


(Colin Surprenant) #1

Hi,

What are the options for creating a new index form an existing very
large data set? do we need to linearly walk the data and insert each
document one-by-one?

Otherwise, given a distributed datastore with mapreduce support, would
it be possible to leverage such a framework to distribute the ES index
creation by launching mapreduce functions to, for example, compute
some new information over our existing data and create a new index
from it??

Thanks for your help,
Colin


(Shay Banon) #2

ElasticSearch is a distributed search engine. So search is distributed, but
also indexing. The more nodes you have, the better indexing throughput you
will get (assuming you index from multiple threads / multiple clients).

One option, as you suggested, is to linearly walk the data you have, and
index it. Even with a single process, you can, of course, parallelize the
indexing process (stuff indexing into a thread pool). If you can parallelize
your fetching of the data, you can do it as well.

An option to do a map/reduce over you data store is also certainly possible.
Just fork jobs, each job fetches (and possibly massages) the data, and index
it into elasticsearch.

In general, the more parallelism you get with your indexing process, the
better. You still use a single elasticsearch cluster with the index API
thanks to the fact that elasticsearch is distributed and highly concurrent
(on a single node).

Some notes to increase your indexing speed with elasticsearch:

  1. If you know in advance that the documents you index do not already exists
    in it, use the opType create with the index API.
  2. If you don't need near real time search of 1 second, increase it
    (described here:
    http://www.elasticsearch.com/docs/elasticsearch/index_modules/engine/robin/#Refresh
    ).

-shay.banon

On Thu, Mar 25, 2010 at 9:37 PM, Colin Surprenant <
colin.surprenant@gmail.com> wrote:

Hi,

What are the options for creating a new index form an existing very
large data set? do we need to linearly walk the data and insert each
document one-by-one?

Otherwise, given a distributed datastore with mapreduce support, would
it be possible to leverage such a framework to distribute the ES index
creation by launching mapreduce functions to, for example, compute
some new information over our existing data and create a new index
from it??

Thanks for your help,
Colin


#3

Do you mean using Elastic Search to distribute the process of loading/
transforming the data?
Or just the distribution of the indexing (which ES does).

-Nick

On Mar 25, 7:37 pm, Colin Surprenant colin.surpren...@gmail.com
wrote:

Hi,

What are the options for creating a new index form an existing very
large data set? do we need to linearly walk the data and insert each
document one-by-one?

Otherwise, given a distributed datastore with mapreduce support, would
it be possible to leverage such a framework to distribute the ES index
creation by launching mapreduce functions to, for example, compute
some new information over our existing data and create a new index
from it??

Thanks for your help,
Colin


(Colin Surprenant) #4

I mean leveraging a mapreduce framework over a distributed datastore
to speedup the index creation.

I have a very large dataset over which we run mapreduce tasks to
analyse and/or transform the data and for which we might want to
create new indexes for this new transformed/extracted data. The most
efficient solution would be to create the indexes as part of the
mapreduce tasks to distribute the process.

Colin

On Mar 25, 5:15 pm, the claw nick.minute...@gmail.com wrote:

Do you mean using Elastic Search to distribute the process of loading/
transforming the data?
Or just the distribution of the indexing (which ES does).

-Nick

On Mar 25, 7:37 pm, Colin Surprenant colin.surpren...@gmail.com
wrote:

Hi,

What are the options for creating a new index form an existing very
large data set? do we need to linearly walk the data and insert each
document one-by-one?

Otherwise, given a distributed datastore with mapreduce support, would
it be possible to leverage such a framework to distribute the ES index
creation by launching mapreduce functions to, for example, compute
some new information over our existing data and create a new index
from it??

Thanks for your help,
Colin


(Shay Banon) #5

Sounds perfect. Do you use Hadoop? If so, simply use the native Java APIs
elasticsearch comes with and call the index API as part of your jobs.

-shay.banon

On Fri, Mar 26, 2010 at 1:08 AM, Colin Surprenant <
colin.surprenant@gmail.com> wrote:

I mean leveraging a mapreduce framework over a distributed datastore
to speedup the index creation.

I have a very large dataset over which we run mapreduce tasks to
analyse and/or transform the data and for which we might want to
create new indexes for this new transformed/extracted data. The most
efficient solution would be to create the indexes as part of the
mapreduce tasks to distribute the process.

Colin

On Mar 25, 5:15 pm, the claw nick.minute...@gmail.com wrote:

Do you mean using Elastic Search to distribute the process of loading/
transforming the data?
Or just the distribution of the indexing (which ES does).

-Nick

On Mar 25, 7:37 pm, Colin Surprenant colin.surpren...@gmail.com
wrote:

Hi,

What are the options for creating a new index form an existing very
large data set? do we need to linearly walk the data and insert each
document one-by-one?

Otherwise, given a distributed datastore with mapreduce support, would
it be possible to leverage such a framework to distribute the ES index
creation by launching mapreduce functions to, for example, compute
some new information over our existing data and create a new index
from it??

Thanks for your help,
Colin


(Colin Surprenant) #6

Well, we do have a hadoop prototype but with lucene and katta. I am
currently loooking into riak.

Colin

On Mar 25, 6:36 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

Sounds perfect. Do you use Hadoop? If so, simply use the native Java APIs
elasticsearch comes with and call the index API as part of your jobs.

-shay.banon

On Fri, Mar 26, 2010 at 1:08 AM, Colin Surprenant <

colin.surpren...@gmail.com> wrote:

I mean leveraging a mapreduce framework over a distributed datastore
to speedup the index creation.

I have a very large dataset over which we run mapreduce tasks to
analyse and/or transform the data and for which we might want to
create new indexes for this new transformed/extracted data. The most
efficient solution would be to create the indexes as part of the
mapreduce tasks to distribute the process.

Colin

On Mar 25, 5:15 pm, the claw nick.minute...@gmail.com wrote:

Do you mean using Elastic Search to distribute the process of loading/
transforming the data?
Or just the distribution of the indexing (which ES does).

-Nick

On Mar 25, 7:37 pm, Colin Surprenant colin.surpren...@gmail.com
wrote:

Hi,

What are the options for creating a new index form an existing very
large data set? do we need to linearly walk the data and insert each
document one-by-one?

Otherwise, given a distributed datastore with mapreduce support, would
it be possible to leverage such a framework to distribute the ES index
creation by launching mapreduce functions to, for example, compute
some new information over our existing data and create a new index
from it??

Thanks for your help,
Colin


(Shay Banon) #7

Not really sure I understood then. Where do you store your data that you
plan to run your map reduce jobs on? Hadoop for Lucene+Katta is not where
you store your data...

-shay.banon

On Fri, Mar 26, 2010 at 7:23 AM, Colin Surprenant <
colin.surprenant@gmail.com> wrote:

Well, we do have a hadoop prototype but with lucene and katta. I am
currently loooking into riak.

Colin

On Mar 25, 6:36 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

Sounds perfect. Do you use Hadoop? If so, simply use the native Java APIs
elasticsearch comes with and call the index API as part of your jobs.

-shay.banon

On Fri, Mar 26, 2010 at 1:08 AM, Colin Surprenant <

colin.surpren...@gmail.com> wrote:

I mean leveraging a mapreduce framework over a distributed datastore
to speedup the index creation.

I have a very large dataset over which we run mapreduce tasks to
analyse and/or transform the data and for which we might want to
create new indexes for this new transformed/extracted data. The most
efficient solution would be to create the indexes as part of the
mapreduce tasks to distribute the process.

Colin

On Mar 25, 5:15 pm, the claw nick.minute...@gmail.com wrote:

Do you mean using Elastic Search to distribute the process of
loading/

transforming the data?
Or just the distribution of the indexing (which ES does).

-Nick

On Mar 25, 7:37 pm, Colin Surprenant colin.surpren...@gmail.com
wrote:

Hi,

What are the options for creating a new index form an existing very
large data set? do we need to linearly walk the data and insert
each

document one-by-one?

Otherwise, given a distributed datastore with mapreduce support,
would

it be possible to leverage such a framework to distribute the ES
index

creation by launching mapreduce functions to, for example, compute
some new information over our existing data and create a new index
from it??

Thanks for your help,
Colin


(Colin Surprenant) #8

In our hadoop+katta prototype, the data is simply in hdfs and we made
an index creating job based on the katta hadoop indexing job samples.

What I am currently looking at is to use riak http://riak.basho.com/
for data storage and their mapreduce framework https://wiki.basho.com/display/RIAK/MapReduce
to launch elasticsearch index creation jobs. I am trying evaluate what
will be the most efficient way to parallelize the index creation when
creating a new index over the complete data and what would be the best
integration point between riak mapreduce and elasticsearch.

Colin

On Mar 26, 6:14 am, Shay Banon shay.ba...@elasticsearch.com wrote:

Not really sure I understood then. Where do you store your data that you
plan to run your map reduce jobs on? Hadoop for Lucene+Katta is not where
you store your data...

-shay.banon

On Fri, Mar 26, 2010 at 7:23 AM, Colin Surprenant <

colin.surpren...@gmail.com> wrote:

Well, we do have a hadoop prototype but with lucene and katta. I am
currently loooking into riak.

Colin

On Mar 25, 6:36 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

Sounds perfect. Do you use Hadoop? If so, simply use the native Java APIs
elasticsearch comes with and call the index API as part of your jobs.

-shay.banon

On Fri, Mar 26, 2010 at 1:08 AM, Colin Surprenant <

colin.surpren...@gmail.com> wrote:

I mean leveraging a mapreduce framework over a distributed datastore
to speedup the index creation.

I have a very large dataset over which we run mapreduce tasks to
analyse and/or transform the data and for which we might want to
create new indexes for this new transformed/extracted data. The most
efficient solution would be to create the indexes as part of the
mapreduce tasks to distribute the process.

Colin

On Mar 25, 5:15 pm, the claw nick.minute...@gmail.com wrote:

Do you mean using Elastic Search to distribute the process of
loading/

transforming the data?
Or just the distribution of the indexing (which ES does).

-Nick

On Mar 25, 7:37 pm, Colin Surprenant colin.surpren...@gmail.com
wrote:

Hi,

What are the options for creating a new index form an existing very
large data set? do we need to linearly walk the data and insert
each

document one-by-one?

Otherwise, given a distributed datastore with mapreduce support,
would

it be possible to leverage such a framework to distribute the ES
index

creation by launching mapreduce functions to, for example, compute
some new information over our existing data and create a new index
from it??

Thanks for your help,
Colin


(Shay Banon) #9

I see, thanks for sharing. If data is stored in Hadoop, I guess similar map
reduce jobs katta gave as samples can be used with elasticsearch. With riak,
I am not too familiar with it, but the same logic should apply.

-shay.banon

On Fri, Mar 26, 2010 at 5:07 PM, Colin Surprenant <
colin.surprenant@gmail.com> wrote:

In our hadoop+katta prototype, the data is simply in hdfs and we made
an index creating job based on the katta hadoop indexing job samples.

What I am currently looking at is to use riak http://riak.basho.com/
for data storage and their mapreduce framework
https://wiki.basho.com/display/RIAK/MapReduce
to launch elasticsearch index creation jobs. I am trying evaluate what
will be the most efficient way to parallelize the index creation when
creating a new index over the complete data and what would be the best
integration point between riak mapreduce and elasticsearch.

Colin

On Mar 26, 6:14 am, Shay Banon shay.ba...@elasticsearch.com wrote:

Not really sure I understood then. Where do you store your data that you
plan to run your map reduce jobs on? Hadoop for Lucene+Katta is not where
you store your data...

-shay.banon

On Fri, Mar 26, 2010 at 7:23 AM, Colin Surprenant <

colin.surpren...@gmail.com> wrote:

Well, we do have a hadoop prototype but with lucene and katta. I am
currently loooking into riak.

Colin

On Mar 25, 6:36 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

Sounds perfect. Do you use Hadoop? If so, simply use the native Java
APIs

elasticsearch comes with and call the index API as part of your jobs.

-shay.banon

On Fri, Mar 26, 2010 at 1:08 AM, Colin Surprenant <

colin.surpren...@gmail.com> wrote:

I mean leveraging a mapreduce framework over a distributed
datastore

to speedup the index creation.

I have a very large dataset over which we run mapreduce tasks to
analyse and/or transform the data and for which we might want to
create new indexes for this new transformed/extracted data. The
most

efficient solution would be to create the indexes as part of the
mapreduce tasks to distribute the process.

Colin

On Mar 25, 5:15 pm, the claw nick.minute...@gmail.com wrote:

Do you mean using Elastic Search to distribute the process of
loading/

transforming the data?
Or just the distribution of the indexing (which ES does).

-Nick

On Mar 25, 7:37 pm, Colin Surprenant <colin.surpren...@gmail.com

wrote:

Hi,

What are the options for creating a new index form an existing
very

large data set? do we need to linearly walk the data and insert
each

document one-by-one?

Otherwise, given a distributed datastore with mapreduce
support,

would

it be possible to leverage such a framework to distribute the
ES

index

creation by launching mapreduce functions to, for example,
compute

some new information over our existing data and create a new
index

from it??

Thanks for your help,
Colin


(John Lynch-2) #10

Colin,

If it helps here is an example of using Ruby/Typhoeus to concurrently
submit data to Riak and ES.

http://rigelgroupllc.com/wp/blog/tee-with-sinatra

When you run a MR job in Riak, you get the results back via HTTP and
would then need to POST those results to ES. If you were really
concerned about performance, you could write your MR functions in
Erlang an do the POST to ES from there, I believe, although Im not
sure it would be significantly faster. The best way to speed up the
index creation is probably to just have more servers -- Riak and ES
both scale linearly. Im not sure about ES, but Riak will also scale
down, so you have the option of adding more servers for a specific
job, then when it is complete, scale your cluster back down.

Regards,

John Lynch, CTO
Rigel Group, LLC
john@rigelgroupllc.com

Also, in the comments to the post, someone suggests using http://www.pypes.org/
which is a high performance tool to

On Mar 26, 7:07 am, Colin Surprenant colin.surpren...@gmail.com
wrote:

In our hadoop+katta prototype, the data is simply in hdfs and we made
an index creating job based on the katta hadoop indexing job samples.

What I am currently looking at is to use riakhttp://riak.basho.com/
for data storage and their mapreduce frameworkhttps://wiki.basho.com/display/RIAK/MapReduce
to launch elasticsearch index creation jobs. I am trying evaluate what
will be the most efficient way to parallelize the index creation when
creating a new index over the complete data and what would be the best
integration point between riak mapreduce and elasticsearch.

Colin

On Mar 26, 6:14 am, Shay Banon shay.ba...@elasticsearch.com wrote:

Not really sure I understood then. Where do you store your data that you
plan to run your map reduce jobs on? Hadoop for Lucene+Katta is not where
you store your data...

-shay.banon

On Fri, Mar 26, 2010 at 7:23 AM, Colin Surprenant <

colin.surpren...@gmail.com> wrote:

Well, we do have a hadoop prototype but with lucene and katta. I am
currently loooking into riak.

Colin

On Mar 25, 6:36 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

Sounds perfect. Do you use Hadoop? If so, simply use the native Java APIs
elasticsearch comes with and call the index API as part of your jobs.

-shay.banon

On Fri, Mar 26, 2010 at 1:08 AM, Colin Surprenant <

colin.surpren...@gmail.com> wrote:

I mean leveraging a mapreduce framework over a distributed datastore
to speedup the index creation.

I have a very large dataset over which we run mapreduce tasks to
analyse and/or transform the data and for which we might want to
create new indexes for this new transformed/extracted data. The most
efficient solution would be to create the indexes as part of the
mapreduce tasks to distribute the process.

Colin

On Mar 25, 5:15 pm, the claw nick.minute...@gmail.com wrote:

Do you mean using Elastic Search to distribute the process of
loading/

transforming the data?
Or just the distribution of the indexing (which ES does).

-Nick

On Mar 25, 7:37 pm, Colin Surprenant colin.surpren...@gmail.com
wrote:

Hi,

What are the options for creating a new index form an existing very
large data set? do we need to linearly walk the data and insert
each

document one-by-one?

Otherwise, given a distributed datastore with mapreduce support,
would

it be possible to leverage such a framework to distribute the ES
index

creation by launching mapreduce functions to, for example, compute
some new information over our existing data and create a new index
from it??

Thanks for your help,
Colin


(egaumer) #11

On Thu, Mar 25, 2010 at 6:08 PM, Colin Surprenant <
colin.surprenant@gmail.com> wrote:

I mean leveraging a mapreduce framework over a distributed datastore
to speedup the index creation.

I have a very large dataset over which we run mapreduce tasks to
analyse and/or transform the data and for which we might want to
create new indexes for this new transformed/extracted data. The most
efficient solution would be to create the indexes as part of the
mapreduce tasks to distribute the process.

Seems like you need more map and less reduce. The reduce phase (in this
case) seems like it would cause a bottleneck. Why reduce results coming from
riak partitions down to a single data stream? Since ES is distributed, the
goal is to create as many parallel processes (i.e., mappers) as possible. If
you execute a reduce phase then you're essentially reducing the result
stream to a single process (jn which case it needs to be re-parallelized).

Traditionally, you leverage some middleware component (think ETL like) to
help parallelize/partition the data stream. With these newer technologies,
if you can leverage the map phase then that sounds like a more intelligent
solution.

Even better, these new data stores should provide inherent search indexing
as a function the CRUD operations. Think about Apple's Spotlight. You don't
have to think about indexing new documents as you save them to your
filesystem. It all happens for you, behind the scenes. It's inherent to the
file system stack.

Basho has mentioned adding "hooks" to allow these sort of semantics in riak.
Sergio and Shay have provided similar functionality with Terrastore and
Elasticsearch. In my opinion, this is the ideal solution.

Regards,
-Eric


(Colin Surprenant) #12

ah! sorry for the late reply, thanks a lot for the info and pointer,
this is great stuff and exactly in line with what I am working on now,
see thread on the Riak mailing list
http://lists.basho.com/pipermail/riak-users_lists.basho.com/2010-April/000920.html

Thanks again,
Colin

On Tue, Mar 30, 2010 at 1:56 PM, john@rigelgroupllc.com
johnrlynch@gmail.com wrote:

Colin,

If it helps here is an example of using Ruby/Typhoeus to concurrently
submit data to Riak and ES.

http://rigelgroupllc.com/wp/blog/tee-with-sinatra

When you run a MR job in Riak, you get the results back via HTTP and
would then need to POST those results to ES. If you were really
concerned about performance, you could write your MR functions in
Erlang an do the POST to ES from there, I believe, although Im not
sure it would be significantly faster. The best way to speed up the
index creation is probably to just have more servers -- Riak and ES
both scale linearly. Im not sure about ES, but Riak will also scale
down, so you have the option of adding more servers for a specific
job, then when it is complete, scale your cluster back down.

Regards,

John Lynch, CTO
Rigel Group, LLC
john@rigelgroupllc.com

Also, in the comments to the post, someone suggests using http://www.pypes.org/
which is a high performance tool to

On Mar 26, 7:07 am, Colin Surprenant colin.surpren...@gmail.com
wrote:

In our hadoop+katta prototype, the data is simply in hdfs and we made
an index creating job based on the katta hadoop indexing job samples.

What I am currently looking at is to use riakhttp://riak.basho.com/
for data storage and their mapreduce frameworkhttps://wiki.basho.com/display/RIAK/MapReduce
to launch elasticsearch index creation jobs. I am trying evaluate what
will be the most efficient way to parallelize the index creation when
creating a new index over the complete data and what would be the best
integration point between riak mapreduce and elasticsearch.

Colin

On Mar 26, 6:14 am, Shay Banon shay.ba...@elasticsearch.com wrote:

Not really sure I understood then. Where do you store your data that you
plan to run your map reduce jobs on? Hadoop for Lucene+Katta is not where
you store your data...

-shay.banon

On Fri, Mar 26, 2010 at 7:23 AM, Colin Surprenant <

colin.surpren...@gmail.com> wrote:

Well, we do have a hadoop prototype but with lucene and katta. I am
currently loooking into riak.

Colin

On Mar 25, 6:36 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

Sounds perfect. Do you use Hadoop? If so, simply use the native Java APIs
elasticsearch comes with and call the index API as part of your jobs.

-shay.banon

On Fri, Mar 26, 2010 at 1:08 AM, Colin Surprenant <

colin.surpren...@gmail.com> wrote:

I mean leveraging a mapreduce framework over a distributed datastore
to speedup the index creation.

I have a very large dataset over which we run mapreduce tasks to
analyse and/or transform the data and for which we might want to
create new indexes for this new transformed/extracted data. The most
efficient solution would be to create the indexes as part of the
mapreduce tasks to distribute the process.

Colin

On Mar 25, 5:15 pm, the claw nick.minute...@gmail.com wrote:

Do you mean using Elastic Search to distribute the process of
loading/

transforming the data?
Or just the distribution of the indexing (which ES does).

-Nick

On Mar 25, 7:37 pm, Colin Surprenant colin.surpren...@gmail.com
wrote:

Hi,

What are the options for creating a new index form an existing very
large data set? do we need to linearly walk the data and insert
each

document one-by-one?

Otherwise, given a distributed datastore with mapreduce support,
would

it be possible to leverage such a framework to distribute the ES
index

creation by launching mapreduce functions to, for example, compute
some new information over our existing data and create a new index
from it??

Thanks for your help,
Colin


(system) #13