[Hadoop] capability clarification questions


(Josh Harrison) #1

In looking around I haven't been able to find explicit answers to these
questions - though the questions may entirely be because I'm a hadoop
newbie.
If we were to deploy ES within a hadoop environment:
The primary benefit is allowing direct interaction with ES from Hadoop,
running queries or indexing data, is that right?
Are there explicit benefits to search speed and capability when run through
the normal REST or other client APIs? That is to say, if I have a set of N
documents and a query that takes T seconds to run on a normal cluster
through curl, would there be a marked improvement in T when running the
same query through curl against a hadoop enabled cluster?
Are the ideal architecture designs for a hadoop enabled ES cluster the
same, or similar to, a "regular" cluster?
If they're the same, does a hadoop enabled cluster need to be designed as
such from the start, or can that functionality be tacked on to an already
functioning cluster with data? Situation is, we're on a cluster of machines
running hadoop, but the ES nodes are just running on the compute nodes like
a regular service. Wondering what it would take to enable the hadoop
capabilities.

Thanks!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/4dea6c95-75b8-4ed7-a054-3f9eaedde9d3%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Costin Leau) #2

Hi,

This section [1] of the Elasticsearch for Apache Hadoop reference tries to answer your questions. In other words, as
oppose to a 'normal' client, es-hadoop 'parallelizes' your reads and writes so if you have a Hadoop job with 5 tasks
running in parallel, you'll end up with 5 parallel writes to Es.
In a similar vein, when you read data from Es you'll get parallel reads - so if your index has 5 shards, you'll end up
with 5 different tasks streaming/reading data from Es.

Regarding deployment, es-hadoop works against Apache Hadoop 1.x and 2.x and various other Hadoop distros. There's
nothing extra that you have to do to your cluster; again this is covered in the reference docs here [2].

As for deployment, you can install Es on the same physical cluster as Hadoop or on a separate one; it's really up to you
and your hardware. As long as you have spare RAM and CPU, you can co-locate the two (which es-hadoop will take advantage
of) - in fact, you don't have to have the same amount of ES and Hadoop nodes, you can mix and match depending on your
requirements.
If I understand correctly, you already are reusing the same machine which is fine.

I suggest taking a look at the docs and trying out the examples in it (which you can find in the readme as well) -
things are easy to install and there's no extra provisioning that needs to be done.

Cheers,

[1] http://www.elasticsearch.org/guide/en/elasticsearch/hadoop/current/arch.html
[2] http://www.elasticsearch.org/guide/en/elasticsearch/hadoop/current/features.html

On 30/01/2014 9:55 PM, Josh Harrison wrote:

In looking around I haven't been able to find explicit answers to these questions - though the questions may entirely be
because I'm a hadoop newbie.
If we were to deploy ES within a hadoop environment:
The primary benefit is allowing direct interaction with ES from Hadoop, running queries or indexing data, is that right?
Are there explicit benefits to search speed and capability when run through the normal REST or other client APIs? That
is to say, if I have a set of N documents and a query that takes T seconds to run on a normal cluster through curl,
would there be a marked improvement in T when running the same query through curl against a hadoop enabled cluster?
Are the ideal architecture designs for a hadoop enabled ES cluster the same, or similar to, a "regular" cluster?
If they're the same, does a hadoop enabled cluster need to be designed as such from the start, or can that functionality
be tacked on to an already functioning cluster with data? Situation is, we're on a cluster of machines running hadoop,
but the ES nodes are just running on the compute nodes like a regular service. Wondering what it would take to enable
the hadoop capabilities.

Thanks!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/4dea6c95-75b8-4ed7-a054-3f9eaedde9d3%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
Costin

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/52EF74CA.5070202%40gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(system) #3