Building with Hadoop - best setup?


(timrobertson100) #1

Hey,

Is anyone building their indexes using Hadoop? If so, are they deploying ES
across the same cluster as Hadoop and trying to reduce network noise by
making use of data locality, or keeping the clusters separate and just
calling over HTTP from MapReduce when building the indexes? I am about to
set up on EC2, and planned to keep the search and processing machines
separate.

Cheers,
Tim


(Shay Banon) #2

Hi,

You won't enjoy locally between elasticsearch and hadoop in any case since
both use different distribution model. The locality would only make sense
for the indexing part, and think that you probably won't really need it (it
should be fast enough).

What language are you going to write your jobs at? If Java, then make use
of the native Java client (obtained from a "non data" Server started) and
not HTTP. More here:
http://www.elasticsearch.com/docs/elasticsearch/java_api/client/#Server_Client

-shay.banon

On Fri, Mar 26, 2010 at 5:35 PM, Tim Robertson timrobertson100@gmail.comwrote:

Hey,

Is anyone building their indexes using Hadoop? If so, are they deploying
ES across the same cluster as Hadoop and trying to reduce network noise by
making use of data locality, or keeping the clusters separate and just
calling over HTTP from MapReduce when building the indexes? I am about to
set up on EC2, and planned to keep the search and processing machines
separate.

Cheers,
Tim


(timrobertson100) #3

Thanks Shay,

So... reading between the lines, does it then use protobufs (or other?) for
RPC instead of JSON and serializing and deserializing?

Cheers
Tim

On Fri, Mar 26, 2010 at 3:58 PM, Shay Banon shay.banon@elasticsearch.comwrote:

Hi,

You won't enjoy locally between elasticsearch and hadoop in any case
since both use different distribution model. The locality would only make
sense for the indexing part, and think that you probably won't really need
it (it should be fast enough).

What language are you going to write your jobs at? If Java, then make use
of the native Java client (obtained from a "non data" Server started) and
not HTTP. More here:
http://www.elasticsearch.com/docs/elasticsearch/java_api/client/#Server_Client

-shay.banon

On Fri, Mar 26, 2010 at 5:35 PM, Tim Robertson timrobertson100@gmail.comwrote:

Hey,

Is anyone building their indexes using Hadoop? If so, are they deploying
ES across the same cluster as Hadoop and trying to reduce network noise by
making use of data locality, or keeping the clusters separate and just
calling over HTTP from MapReduce when building the indexes? I am about to
set up on EC2, and planned to keep the search and processing machines
separate.

Cheers,
Tim


(Shay Banon) #4

With the Java client, the "source" (which is the indexable document) is
still json, but everything around it is a highly optimized stream
serialization/deserialization. It is internal and does not use protobuff,
but I expect it to be at least 10x faster than protobuf.

-shay.banon

On Fri, Mar 26, 2010 at 8:10 PM, Tim Robertson timrobertson100@gmail.comwrote:

Thanks Shay,

So... reading between the lines, does it then use protobufs (or other?) for
RPC instead of JSON and serializing and deserializing?

Cheers
Tim

On Fri, Mar 26, 2010 at 3:58 PM, Shay Banon shay.banon@elasticsearch.comwrote:

Hi,

You won't enjoy locally between elasticsearch and hadoop in any case
since both use different distribution model. The locality would only make
sense for the indexing part, and think that you probably won't really need
it (it should be fast enough).

What language are you going to write your jobs at? If Java, then make
use of the native Java client (obtained from a "non data" Server started)
and not HTTP. More here:
http://www.elasticsearch.com/docs/elasticsearch/java_api/client/#Server_Client

-shay.banon

On Fri, Mar 26, 2010 at 5:35 PM, Tim Robertson <timrobertson100@gmail.com

wrote:

Hey,

Is anyone building their indexes using Hadoop? If so, are they deploying
ES across the same cluster as Hadoop and trying to reduce network noise by
making use of data locality, or keeping the clusters separate and just
calling over HTTP from MapReduce when building the indexes? I am about to
set up on EC2, and planned to keep the search and processing machines
separate.

Cheers,
Tim


(Shay Banon) #5

Just one note here thought, what takes most of the time in this cases is the
remote call itself, not the serialization. But when it comes to pure
serialization, its highly optimized.

-shay.banon

On Fri, Mar 26, 2010 at 8:59 PM, Shay Banon shay.banon@elasticsearch.comwrote:

With the Java client, the "source" (which is the indexable document) is
still json, but everything around it is a highly optimized stream
serialization/deserialization. It is internal and does not use protobuff,
but I expect it to be at least 10x faster than protobuf.

-shay.banon

On Fri, Mar 26, 2010 at 8:10 PM, Tim Robertson timrobertson100@gmail.comwrote:

Thanks Shay,

So... reading between the lines, does it then use protobufs (or other?)
for RPC instead of JSON and serializing and deserializing?

Cheers
Tim

On Fri, Mar 26, 2010 at 3:58 PM, Shay Banon <shay.banon@elasticsearch.com

wrote:

Hi,

You won't enjoy locally between elasticsearch and hadoop in any case
since both use different distribution model. The locality would only make
sense for the indexing part, and think that you probably won't really need
it (it should be fast enough).

What language are you going to write your jobs at? If Java, then make
use of the native Java client (obtained from a "non data" Server started)
and not HTTP. More here:
http://www.elasticsearch.com/docs/elasticsearch/java_api/client/#Server_Client

-shay.banon

On Fri, Mar 26, 2010 at 5:35 PM, Tim Robertson <
timrobertson100@gmail.com> wrote:

Hey,

Is anyone building their indexes using Hadoop? If so, are they
deploying ES across the same cluster as Hadoop and trying to reduce network
noise by making use of data locality, or keeping the clusters separate and
just calling over HTTP from MapReduce when building the indexes? I am about
to set up on EC2, and planned to keep the search and processing machines
separate.

Cheers,
Tim


(system) #6