On 2/26/2014 11:26 PM, drew dahlke wrote:
We're very interested offline processing as well. To draw a parallel to
HBase, you could write a hadoop job that writes
out to a table over the thrift API. However if you're going to load in
many terrabytes of data, there's the option to
write out directly to the HTable file format and bulk load the file into
your cluster once generated. Bulk loading was
orders of magnitude faster than the HTTP based API.
Writing out lucene segments from a hadoop job is nothing new (check out
the katta project
http://katta.sourceforge.net/). I saw that ES has snapshot
backup/restore in the pipeline. It'd be fantastic if we could
write hadoop jobs that output data in the same format as ES backups and
then use the restore functionality in ES to bulk
load the data directly without having to go through a REST API. I feel
like that would be faster and it would provide
the flexibility to scale out the hadoop cluster independently of the ES
If you are concerned about indexing data directly into a live cluster, you
could just have a different, staging one
setup (with a complete topology as well) to which you can index data. Then
do a snapshot (potentially to HDFS) and then
load the data into your live one.
This is already supported - see this blog post .
Note that ES uses Lucene internally but the segments are just some part of
its internal metadata.
Recreating an ES index directly into a job means hacking and
reimplementing a lot of the distributed work that ES is
already doing for you without a lot (if any) performance gains:
- each job/task would have its own ES instance. This means a 1:1 mapping
between the job tasks and ES nodes which is a
waste of resources.
- each ES instance would rely on the running task/job machine. This can
overload the hardware since you are forced to
co-locate the two whether you want it or not.
- at the end each ES instance would have to export its data somehow. Since
each node only gets some chunk of the data,
the indices would have to be aggregated.
This implies significant I/O (since you are moving the same data multiple
times) and at least twice the amount of disk
space. The network I/O gets significantly amplified when using HDFS for
indexing since the disk is not local; with ES
you can chose to use HDFS or not (for best performance I would advise
Consider here the allocation of data within ES shards/nodes (based on the
cluster topology + user settings). For the
most part, this will be similar to another reindexing.
The current approach of es-hadoop has none of these issues and all the
benefits. You can scale Hadoop or ES independent
of each other - your job can have 10s (or 100s in some cases) of tasks
that are streaming data to an ES cluster of 5-10
beefy nodes. You can start with ES co-located on the same physical
machines as Hadoop and, as you grow move some or all
the nodes to a different setup.
Since es-hadoop parallelizes both reads and writes, the hadoop job gets
full access to the ES cluster; the bigger the
target index is, the more shards it can talk to in parallel.
Additionally, there's minimal I/O - we only move the data needed once to
If you have a performance problem caused by es-hadoop, I'd be happy to
look at the numbers/stats.
Hope this helps,
On Saturday, June 22, 2013 10:18:57 AM UTC-4, Costin Leau wrote:
I'm not sure what you mean by "offline in Hadoop"...
Indexing the data requires ES or you could try and replicate it
manually but I would argue you'll end up duplicating
work done in ES.
You could potentially setup a smaller (even one node) ES cluster
just for indexing in parallel or collocated with your
Hadoop cluster - you could use this to do the indexing and then copy
the indexes over to the live cluster.
That is, you'll have two ES clusters: one for staging/indexing and
another one for live/read-only data...
On 21/06/2013 9:27 PM, Jack Liu wrote:
> Thanks Costin,
> I am afraid that I am not allowed to use it ( or any API), because
of the cluster policy. What I am looking for is to
> complete the indexing part entirely offline
> in the hadoop, is it feasible though?
> On Friday, June 21, 2013 10:47:25 AM UTC-7, Costin Leau wrote:
> Have you looked at Elasticsearch-Hadoop  ? You can use it
to stream data to/from ES to/from Hadoop.
> https://github.com/elasticsearch/elasticsearch-hadoop/ <
> On 21/06/2013 8:38 PM, Jack Liu wrote:
> > Hi all,
> > I am new to ES, and we have large set of data need to be
indexed into ES cluster daily (there is no delta available, we
> > only have 7~8 nodes).
> > I know use mapper function to directly call client api
should be fine, however, our hadoop cluster policy does not allow
> > that.
> > So I am wondering if there is a way to just generate ES
index in the hadoop, and then copy them into the cluster and ES
> > could pick them up when reloading.
> > Or could anyone point me to right place in the source code
that is related to it.
> > Any suggestion could be very helpful !
> > Many thanks
> > Jack
> > --
> > You received this message because you are subscribed to the
Google Groups "elasticsearch" group.
> > To unsubscribe from this group and stop receiving emails
from it, send an email to
> > For more options, visithttps://
> You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it,
send an email to
> For more options, visithttps://groups.google.com/groups/opt_out <
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to
To view this discussion on the web visit
For more options, visit https://groups.google.com/groups/opt_out.