How to generate ES index in the hadoop

Hi all,

I am new to ES, and we have large set of data need to be indexed into ES
cluster daily (there is no delta available, we only have 7~8 nodes).
I know use mapper function to directly call client api should be fine,
however, our hadoop cluster policy does not allow that.
So I am wondering if there is a way to just generate ES index in the
hadoop, and then copy them into the cluster and ES could pick them up when
reloading.
Or could anyone point me to right place in the source code that is related
to it.

Any suggestion could be very helpful !

Many thanks
Jack

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Have you looked at Elasticsearch-Hadoop [1] ? You can use it to stream data to/from ES to/from Hadoop.

[1] GitHub - elastic/elasticsearch-hadoop: Elasticsearch real-time search and analytics natively integrated with Hadoop

On 21/06/2013 8:38 PM, Jack Liu wrote:

Hi all,

I am new to ES, and we have large set of data need to be indexed into ES cluster daily (there is no delta available, we
only have 7~8 nodes).
I know use mapper function to directly call client api should be fine, however, our hadoop cluster policy does not allow
that.
So I am wondering if there is a way to just generate ES index in the hadoop, and then copy them into the cluster and ES
could pick them up when reloading.
Or could anyone point me to right place in the source code that is related to it.

Any suggestion could be very helpful !

Many thanks
Jack

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
Costin

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Thanks Costin,

I am afraid that I am not allowed to use it ( or any API), because of the
cluster policy. What I am looking for is to complete the indexing part
entirely offline
in the hadoop, is it feasible though?

On Friday, June 21, 2013 10:47:25 AM UTC-7, Costin Leau wrote:

Have you looked at Elasticsearch-Hadoop [1] ? You can use it to stream
data to/from ES to/from Hadoop.

[1] GitHub - elastic/elasticsearch-hadoop: Elasticsearch real-time search and analytics natively integrated with Hadoop

On 21/06/2013 8:38 PM, Jack Liu wrote:

Hi all,

I am new to ES, and we have large set of data need to be indexed into ES
cluster daily (there is no delta available, we
only have 7~8 nodes).
I know use mapper function to directly call client api should be fine,
however, our hadoop cluster policy does not allow
that.
So I am wondering if there is a way to just generate ES index in the
hadoop, and then copy them into the cluster and ES
could pick them up when reloading.
Or could anyone point me to right place in the source code that is
related to it.

Any suggestion could be very helpful !

Many thanks
Jack

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to
elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
Costin

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

I'm not sure what you mean by "offline in Hadoop"...
Indexing the data requires ES or you could try and replicate it manually but I would argue you'll end up duplicating the
work done in ES.
You could potentially setup a smaller (even one node) ES cluster just for indexing in parallel or collocated with your
Hadoop cluster - you could use this to do the indexing and then copy the indexes over to the live cluster.
That is, you'll have two ES clusters: one for staging/indexing and another one for live/read-only data...

On 21/06/2013 9:27 PM, Jack Liu wrote:

Thanks Costin,

I am afraid that I am not allowed to use it ( or any API), because of the cluster policy. What I am looking for is to
complete the indexing part entirely offline
in the hadoop, is it feasible though?

On Friday, June 21, 2013 10:47:25 AM UTC-7, Costin Leau wrote:

Have you looked at Elasticsearch-Hadoop [1] ? You can use it to stream data to/from ES to/from Hadoop.

[1] https://github.com/elasticsearch/elasticsearch-hadoop/ <https://github.com/elasticsearch/elasticsearch-hadoop/>

On 21/06/2013 8:38 PM, Jack Liu wrote:
> Hi all,
>
> I am new to ES, and we have large set of data need to be indexed into ES cluster daily (there is no delta available, we
> only have 7~8 nodes).
> I know use mapper function to directly call client api should be fine, however, our hadoop cluster policy does not allow
> that.
> So I am wondering if there is a way to just generate ES index in the hadoop, and then copy them into the cluster and ES
> could pick them up when reloading.
> Or could anyone point me to right place in the source code that is related to it.
>
> Any suggestion could be very helpful !
>
> Many thanks
> Jack
>
> --
> You received this message because you are subscribed to the Google Groups "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
>elasticsearc...@googlegroups.com <javascript:>.
> For more options, visithttps://groups.google.com/groups/opt_out <https://groups.google.com/groups/opt_out>.
>
>

--
Costin

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
Costin

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi Costin,

We're very interested offline processing as well. To draw a parallel to
HBase, you could write a hadoop job that writes out to a table over the
thrift API. However if you're going to load in many terrabytes of data,
there's the option to write out directly to the HTable file format and bulk
load the file into your cluster once generated. Bulk loading was orders of
magnitude faster than the HTTP based API.

Writing out lucene segments from a hadoop job is nothing new (check out the
katta project http://katta.sourceforge.net/). I saw that ES has snapshot
backup/restore in the pipeline. It'd be fantastic if we could write hadoop
jobs that output data in the same format as ES backups and then use the
restore functionality in ES to bulk load the data directly without having
to go through a REST API. I feel like that would be faster and it would
provide the flexibility to scale out the hadoop cluster independently of
the ES cluster.

On Saturday, June 22, 2013 10:18:57 AM UTC-4, Costin Leau wrote:

I'm not sure what you mean by "offline in Hadoop"...
Indexing the data requires ES or you could try and replicate it manually
but I would argue you'll end up duplicating the
work done in ES.
You could potentially setup a smaller (even one node) ES cluster just for
indexing in parallel or collocated with your
Hadoop cluster - you could use this to do the indexing and then copy the
indexes over to the live cluster.
That is, you'll have two ES clusters: one for staging/indexing and another
one for live/read-only data...

On 21/06/2013 9:27 PM, Jack Liu wrote:

Thanks Costin,

I am afraid that I am not allowed to use it ( or any API), because of
the cluster policy. What I am looking for is to
complete the indexing part entirely offline
in the hadoop, is it feasible though?

On Friday, June 21, 2013 10:47:25 AM UTC-7, Costin Leau wrote:

Have you looked at Elasticsearch-Hadoop [1] ? You can use it to 

stream data to/from ES to/from Hadoop.

[1] https://github.com/elasticsearch/elasticsearch-hadoop/ <

GitHub - elastic/elasticsearch-hadoop: Elasticsearch real-time search and analytics natively integrated with Hadoop>

On 21/06/2013 8:38 PM, Jack Liu wrote: 
> Hi all, 
> 
> I am new to ES, and we have large set of data need to be indexed 

into ES cluster daily (there is no delta available, we

> only have 7~8 nodes). 
> I know use mapper function to directly call client api should be 

fine, however, our hadoop cluster policy does not allow

> that. 
> So I am wondering if there is a way to just generate ES index in 

the hadoop, and then copy them into the cluster and ES

> could pick them up when reloading. 
> Or could anyone point me to right place in the source code that is 

related to it.

> 
> Any suggestion could be very helpful ! 
> 
> Many thanks 
> Jack 
> 
> -- 
> You received this message because you are subscribed to the Google 

Groups "elasticsearch" group.

> To unsubscribe from this group and stop receiving emails from it, 

send an email to

>elasticsearc...@googlegroups.com <javascript:>. 
> For more options, visithttps://groups.google.com/groups/opt_out <

https://groups.google.com/groups/opt_out>.

> 
> 

-- 
Costin 

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to
elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
Costin

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/2a32eb96-5c30-491a-a501-0a6950d1918f%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

On 2/26/2014 11:26 PM, drew dahlke wrote:

Hi Costin,

We're very interested offline processing as well. To draw a parallel to HBase, you could write a hadoop job that writes
out to a table over the thrift API. However if you're going to load in many terrabytes of data, there's the option to
write out directly to the HTable file format and bulk load the file into your cluster once generated. Bulk loading was
orders of magnitude faster than the HTTP based API.

Writing out lucene segments from a hadoop job is nothing new (check out the katta project
http://katta.sourceforge.net/). I saw that ES has snapshot backup/restore in the pipeline. It'd be fantastic if we could
write hadoop jobs that output data in the same format as ES backups and then use the restore functionality in ES to bulk
load the data directly without having to go through a REST API. I feel like that would be faster and it would provide
the flexibility to scale out the hadoop cluster independently of the ES cluster.

If you are concerned about indexing data directly into a live cluster, you could just have a different, staging one
setup (with a complete topology as well) to which you can index data. Then do a snapshot (potentially to HDFS) and then
load the data into your live one.
This is already supported - see this blog post [1].

Note that ES uses Lucene internally but the segments are just some part of its internal metadata.

Recreating an ES index directly into a job means hacking and reimplementing a lot of the distributed work that ES is
already doing for you without a lot (if any) performance gains:

  • each job/task would have its own ES instance. This means a 1:1 mapping between the job tasks and ES nodes which is a
    waste of resources.
  • each ES instance would rely on the running task/job machine. This can overload the hardware since you are forced to
    co-locate the two whether you want it or not.
  • at the end each ES instance would have to export its data somehow. Since each node only gets some chunk of the data,
    the indices would have to be aggregated.
    This implies significant I/O (since you are moving the same data multiple times) and at least twice the amount of disk
    space. The network I/O gets significantly amplified when using HDFS for indexing since the disk is not local; with ES
    you can chose to use HDFS or not (for best performance I would advise against that).

Consider here the allocation of data within ES shards/nodes (based on the cluster topology + user settings). For the
most part, this will be similar to another reindexing.

The current approach of es-hadoop has none of these issues and all the benefits. You can scale Hadoop or ES independent
of each other - your job can have 10s (or 100s in some cases) of tasks that are streaming data to an ES cluster of 5-10
beefy nodes. You can start with ES co-located on the same physical machines as Hadoop and, as you grow move some or all
the nodes to a different setup.
Since es-hadoop parallelizes both reads and writes, the hadoop job gets full access to the ES cluster; the bigger the
target index is, the more shards it can talk to in parallel.

Additionally, there's minimal I/O - we only move the data needed once to ES.

If you have a performance problem caused by es-hadoop, I'd be happy to look at the numbers/stats.

Hope this helps,

[1] Elasticsearch Platform — Find real-time answers at scale | Elastic

On Saturday, June 22, 2013 10:18:57 AM UTC-4, Costin Leau wrote:

I'm not sure what you mean by "offline in Hadoop"...
Indexing the data requires ES or you could try and replicate it manually but I would argue you'll end up duplicating
the
work done in ES.
You could potentially setup a smaller (even one node) ES cluster just for indexing in parallel or collocated with your
Hadoop cluster - you could use this to do the indexing and then copy the indexes over to the live cluster.
That is, you'll have two ES clusters: one for staging/indexing and another one for live/read-only data...


On 21/06/2013 9:27 PM, Jack Liu wrote:
> Thanks Costin,
>
> I am afraid that I am not allowed to use it ( or any API), because of the cluster policy. What I am looking for is to
> complete the indexing part entirely offline
> in the hadoop, is it feasible though?
>
>
>
> On Friday, June 21, 2013 10:47:25 AM UTC-7, Costin Leau wrote:
>
>     Have you looked at Elasticsearch-Hadoop [1] ? You can use it to stream data to/from ES to/from Hadoop.
>
>     [1]https://github.com/elasticsearch/elasticsearch-hadoop/ <https://github.com/elasticsearch/elasticsearch-hadoop/>
<https://github.com/elasticsearch/elasticsearch-hadoop/ <https://github.com/elasticsearch/elasticsearch-hadoop/>>
>
>     On 21/06/2013 8:38 PM, Jack Liu wrote:
>     > Hi all,
>     >
>     > I am new to ES, and we have large set of data need to be indexed into ES cluster daily (there is no delta available, we
>     > only have 7~8 nodes).
>     > I know use mapper function to directly call client api should be fine, however, our hadoop cluster policy does not allow
>     > that.
>     > So I am wondering if there is a way to just generate ES index in the hadoop, and then copy them into the cluster and ES
>     > could pick them up when reloading.
>     > Or could anyone point me to right place in the source code that is related to it.
>     >
>     > Any suggestion could be very helpful !
>     >
>     > Many thanks
>     > Jack
>     >
>     > --
>     > You received this message because you are subscribed to the Google Groups "elasticsearch" group.
>     > To unsubscribe from this group and stop receiving emails from it, send an email to
>     >elasticsearc...@googlegroups.com <javascript:>.
>     > For more options, visithttps://groups.google.com/groups/opt_out <http://groups.google.com/groups/opt_out> <https://groups.google.com/groups/opt_out
<https://groups.google.com/groups/opt_out>>.
>     >
>     >
>
>     --
>     Costin
>
> --
> You received this message because you are subscribed to the Google Groups "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
>elasticsearc...@googlegroups.com <javascript:>.
> For more options, visithttps://groups.google.com/groups/opt_out <https://groups.google.com/groups/opt_out>.
>
>

--
Costin

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/2a32eb96-5c30-491a-a501-0a6950d1918f%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
Costin

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/530E6353.3030803%40gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hey Costin,

Spinning up a dedicated cluster for the bulk loading seems like the right
thing for us, but that's a big moving piece.

I think there might be a middle ground. I did a hadoop experiment where I
spun up an embedded ES client inside a reducer with the intention of having
each reducer own a single index. We partition data by day, so 365 indices =
365 reducers = 365 single node ES clusters. I'd probably have each client
work independently with no replication. Then all I have to do is have the
client send the generated index to a snapshot repo (perhaps S3) within the
close method of the reducer. Here's a totally hacky snippet to give you an
idea ES Embedded Hadoop Writer - Pastebin.com ..After that it's just a snapshot
restore request on the live cluster to load it in from the snapshot repo.

There's plenty I could do to improve that, it'll certainly require beefy
task trackers, and that's livable. The part that makes me unhappy is that
its using local storage of the task trackers rather than HDFS for writing
the data. The performance will probably be better on local storage, but
this approach would limit the size of an index to:

(the size of the disk on a task tracker) / (num reducer slots per node) /
(magic number).

Thus my question: You sort of alluded that one could use HDFS for writing
Elasticsearch data. I believe Lucene can write directly to HDFS b/c Solr
can use HDFS as it's backing store. However I can't find any literature on
how to back an embedded ES instance with HDFS. Is that possible? Where
should I look?

Thanks!
Drew

On Wednesday, February 26, 2014 4:57:39 PM UTC-5, Costin Leau wrote:

On 2/26/2014 11:26 PM, drew dahlke wrote:

Hi Costin,

We're very interested offline processing as well. To draw a parallel to
HBase, you could write a hadoop job that writes
out to a table over the thrift API. However if you're going to load in
many terrabytes of data, there's the option to
write out directly to the HTable file format and bulk load the file into
your cluster once generated. Bulk loading was
orders of magnitude faster than the HTTP based API.

Writing out lucene segments from a hadoop job is nothing new (check out
the katta project
http://katta.sourceforge.net/). I saw that ES has snapshot
backup/restore in the pipeline. It'd be fantastic if we could
write hadoop jobs that output data in the same format as ES backups and
then use the restore functionality in ES to bulk
load the data directly without having to go through a REST API. I feel
like that would be faster and it would provide
the flexibility to scale out the hadoop cluster independently of the ES
cluster.

If you are concerned about indexing data directly into a live cluster, you
could just have a different, staging one
setup (with a complete topology as well) to which you can index data. Then
do a snapshot (potentially to HDFS) and then
load the data into your live one.
This is already supported - see this blog post [1].

Note that ES uses Lucene internally but the segments are just some part of
its internal metadata.

Recreating an ES index directly into a job means hacking and
reimplementing a lot of the distributed work that ES is
already doing for you without a lot (if any) performance gains:

  • each job/task would have its own ES instance. This means a 1:1 mapping
    between the job tasks and ES nodes which is a
    waste of resources.
  • each ES instance would rely on the running task/job machine. This can
    overload the hardware since you are forced to
    co-locate the two whether you want it or not.
  • at the end each ES instance would have to export its data somehow. Since
    each node only gets some chunk of the data,
    the indices would have to be aggregated.
    This implies significant I/O (since you are moving the same data multiple
    times) and at least twice the amount of disk
    space. The network I/O gets significantly amplified when using HDFS for
    indexing since the disk is not local; with ES
    you can chose to use HDFS or not (for best performance I would advise
    against that).

Consider here the allocation of data within ES shards/nodes (based on the
cluster topology + user settings). For the
most part, this will be similar to another reindexing.

The current approach of es-hadoop has none of these issues and all the
benefits. You can scale Hadoop or ES independent
of each other - your job can have 10s (or 100s in some cases) of tasks
that are streaming data to an ES cluster of 5-10
beefy nodes. You can start with ES co-located on the same physical
machines as Hadoop and, as you grow move some or all
the nodes to a different setup.
Since es-hadoop parallelizes both reads and writes, the hadoop job gets
full access to the ES cluster; the bigger the
target index is, the more shards it can talk to in parallel.

Additionally, there's minimal I/O - we only move the data needed once to
ES.

If you have a performance problem caused by es-hadoop, I'd be happy to
look at the numbers/stats.

Hope this helps,

[1] Elasticsearch Platform — Find real-time answers at scale | Elastic

On Saturday, June 22, 2013 10:18:57 AM UTC-4, Costin Leau wrote:

I'm not sure what you mean by "offline in Hadoop"... 
Indexing the data requires ES or you could try and replicate it 

manually but I would argue you'll end up duplicating

the 
work done in ES. 
You could potentially setup a smaller (even one node) ES cluster 

just for indexing in parallel or collocated with your

Hadoop cluster - you could use this to do the indexing and then copy 

the indexes over to the live cluster.

That is, you'll have two ES clusters: one for staging/indexing and 

another one for live/read-only data...

On 21/06/2013 9:27 PM, Jack Liu wrote: 
> Thanks Costin, 
> 
> I am afraid that I am not allowed to use it ( or any API), because 

of the cluster policy. What I am looking for is to

> complete the indexing part entirely offline 
> in the hadoop, is it feasible though? 
> 
> 
> 
> On Friday, June 21, 2013 10:47:25 AM UTC-7, Costin Leau wrote: 
> 
>     Have you looked at Elasticsearch-Hadoop [1] ? You can use it 

to stream data to/from ES to/from Hadoop.

> 
>     [1]https://github.com/elasticsearch/elasticsearch-hadoop/ <

GitHub - elastic/elasticsearch-hadoop: Elasticsearch real-time search and analytics natively integrated with Hadoop>

<https://github.com/elasticsearch/elasticsearch-hadoop/ <

GitHub - elastic/elasticsearch-hadoop: Elasticsearch real-time search and analytics natively integrated with Hadoop>>

> 
>     On 21/06/2013 8:38 PM, Jack Liu wrote: 
>     > Hi all, 
>     > 
>     > I am new to ES, and we have large set of data need to be 

indexed into ES cluster daily (there is no delta available, we

>     > only have 7~8 nodes). 
>     > I know use mapper function to directly call client api 

should be fine, however, our hadoop cluster policy does not allow

>     > that. 
>     > So I am wondering if there is a way to just generate ES 

index in the hadoop, and then copy them into the cluster and ES

>     > could pick them up when reloading. 
>     > Or could anyone point me to right place in the source code 

that is related to it.

>     > 
>     > Any suggestion could be very helpful ! 
>     > 
>     > Many thanks 
>     > Jack 
>     > 
>     > -- 
>     > You received this message because you are subscribed to the 

Google Groups "elasticsearch" group.

>     > To unsubscribe from this group and stop receiving emails 

from it, send an email to

>     >elasticsearc...@googlegroups.com <javascript:>. 
>     > For more options, visithttps://

groups.google.com/groups/opt_out http://groups.google.com/groups/opt_out
<https://groups.google.com/groups/opt_out

<https://groups.google.com/groups/opt_out>>. 
>     > 
>     > 
> 
>     -- 
>     Costin 
> 
> -- 
> You received this message because you are subscribed to the Google 

Groups "elasticsearch" group.

> To unsubscribe from this group and stop receiving emails from it, 

send an email to

>elasticsearc...@googlegroups.com <javascript:>. 
> For more options, visithttps://groups.google.com/groups/opt_out <

https://groups.google.com/groups/opt_out>.

> 
> 

-- 
Costin 

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to
elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit

https://groups.google.com/d/msgid/elasticsearch/2a32eb96-5c30-491a-a501-0a6950d1918f%40googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

--
Costin

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/13a0cfcd-3bce-4164-af07-e13f6bf7003d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.