[HADOOP] Anyone used TransportClient for writing to ES from Hadoop mappers?

Hi all,

Currently I am working with elasticsearch-hadoop library with EsOutputFormat that
is writing to elasticsearch,
But it looks to me like the writing is slow (elasticsearch-hadoop works
with HTTP bulks on port 9200)
So my question is it worth to try to write something of my own that will
use TransportClient of elasticsearch which will write bulks to
elasticsearch via tcp on 9300 ?
(I think it should reduce opening http sockets each bulk, also I will no
longer write to Hadoop Context writable objects, but directly to elastic
search bulks...)

Anyone had experience with that? can share it?

Thanks,
Igor

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/410eee8e-780a-4b50-9f40-06aa5e9820c7%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Hi,

Some of the reasons behind using of REST/HTTP are:

  • no extra dependencies required (the transport client add 8MB+)
  • fairly good performance. This is a hot topic however due to Map/Reduce parallel nature, it's very likely that one will
    overload ES before having to switch to the transport client to push more data through the network.
  • ease of deployment (firewalls, proxies, etc...)

If you encounter performance problems, let me know what these are and I'll try to help. There are various things that
can be done (now and in the future) to further increase the through-put.

A couple of remarks:

  • es-hadoop does not open a write http socket on each bulk but rather for an entire write task (which implies multiple
    bulk requests). If that's not the case, it's a bug
  • not sure what you mean by Hadoop Context writable objects - can you provide some context?

When one reads data in Hadoop, it will use Map/Reduce or any of the libraries on top of it which imply converting the
raw data (CSV, TSV, gzip, snappy, etc...) into Writable objects (or Tuples in Cascading and Pig, rows in Hive). This
is how Hadoop works and it's not a requirement of es-hadoop.

Some folks break down the data into:
a. fine-grained objects - e.g. the line is read as a map/multiple tokens/array
b. the line is read as one and by byte-array/Text and used a

Both cases are supported by es-hadoop. a) is the classic Map/Reduce case while b) is used when looking for performance;
typically one would either have the data directly in JSON and 'stream' it through, line by line or convert the line of
text into a JSON document and then push it to ES.
If the source is JSON [1], then es-hadoop 'streams' the data directly to ES without any processing.

Again, the Writable usage is mandated by Hadoop mainly to allow Mapper/Reducers to communicate with each other. There
are a couple of ideas on how to improve this (including object pooling) in future versions but in the benchmarks ran,
this was far from the 'hot' list.

When doing measurements, make sure to separate the Hadoop aspect from es-hadoop. A good way of doing that is ingesting
data in a Hadoop vs non-Hadoop environment; for example try using Cascading in local platform vs Hadoop on your desired
data set; the local mode runs really fast but can only scale to the current machine. The Hadoop one is considerably
slower on one machine, however it scales to as many nodes as your Hadoop cluster has.
You can find such a test in our suite - see the Cascading module.

Hope this helps,

[1] Elasticsearch Platform — Find real-time answers at scale | Elastic

On 4/17/14 11:16 PM, Igor Romanov wrote:

Hi all,

Currently I am working with elasticsearch-hadoop library with EsOutputFormat that is writing to elasticsearch,
But it looks to me like the writing is slow (elasticsearch-hadoop works with HTTP bulks on port 9200)
So my question is it worth to try to write something of my own that will use TransportClient of elasticsearch which will
write bulks to elasticsearch via tcp on 9300 ?
(I think it should reduce opening http sockets each bulk, also I will no longer write to Hadoop Context writable
objects, but directly to Elasticsearch bulks...)

Anyone had experience with that? can share it?

Thanks,
Igor

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to
elasticsearch+unsubscribe@googlegroups.com mailto:elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/410eee8e-780a-4b50-9f40-06aa5e9820c7%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/410eee8e-780a-4b50-9f40-06aa5e9820c7%40googlegroups.com?utm_medium=email&utm_source=footer.
For more options, visit https://groups.google.com/d/optout.

--
Costin

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/5350550C.2030805%40gmail.com.
For more options, visit https://groups.google.com/d/optout.

Thank you for your answer,
I did some tests, by writing something simple that writes bulks using
TransportClient to ES server in parallel mode (something like
BulkProcessor...),
and it looks that the hadoop job run pretty same time, so there is no big
difference here with using "es-hadoop"

But it looks for me that ES server consume less CPU when it receive bulks
via TransportClient

Anyway looks like my bottleneck somewhere inside hadoop mapper job (maybe
parser or reading lines from file on S3... :confused: ), so I will research it...

Thanks!
Igor.

On Thursday, April 17, 2014 11:16:33 PM UTC+3, Igor Romanov wrote:

Hi all,

Currently I am working with elasticsearch-hadoop library with
EsOutputFormat that is writing to elasticsearch,
But it looks to me like the writing is slow (elasticsearch-hadoop works
with HTTP bulks on port 9200)
So my question is it worth to try to write something of my own that will
use TransportClient of elasticsearch which will write bulks to
elasticsearch via tcp on 9300 ?
(I think it should reduce opening http sockets each bulk, also I will no
longer write to Hadoop Context writable objects, but directly to elastic
search bulks...)

Anyone had experience with that? can share it?

Thanks,
Igor

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/d4a0d60d-0d8a-40cd-9bd7-52a5724587d1%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.