Hi,
Some of the reasons behind using of REST/HTTP are:
- no extra dependencies required (the transport client add 8MB+)
- fairly good performance. This is a hot topic however due to Map/Reduce parallel nature, it's very likely that one will
overload ES before having to switch to the transport client to push more data through the network.
- ease of deployment (firewalls, proxies, etc...)
If you encounter performance problems, let me know what these are and I'll try to help. There are various things that
can be done (now and in the future) to further increase the through-put.
A couple of remarks:
- es-hadoop does not open a write http socket on each bulk but rather for an entire write task (which implies multiple
bulk requests). If that's not the case, it's a bug
- not sure what you mean by
Hadoop Context writable
objects - can you provide some context?
When one reads data in Hadoop, it will use Map/Reduce or any of the libraries on top of it which imply converting the
raw data (CSV, TSV, gzip, snappy, etc...) into Writable objects (or Tuples in Cascading and Pig, row
s in Hive). This
is how Hadoop works and it's not a requirement of es-hadoop.
Some folks break down the data into:
a. fine-grained objects - e.g. the line is read as a map/multiple tokens/array
b. the line is read as one and by byte-array/Text and used a
Both cases are supported by es-hadoop. a) is the classic Map/Reduce case while b) is used when looking for performance;
typically one would either have the data directly in JSON and 'stream' it through, line by line or convert the line of
text into a JSON document and then push it to ES.
If the source is JSON [1], then es-hadoop 'streams' the data directly to ES without any processing.
Again, the Writable usage is mandated by Hadoop mainly to allow Mapper/Reducers to communicate with each other. There
are a couple of ideas on how to improve this (including object pooling) in future versions but in the benchmarks ran,
this was far from the 'hot' list.
When doing measurements, make sure to separate the Hadoop aspect from es-hadoop. A good way of doing that is ingesting
data in a Hadoop vs non-Hadoop environment; for example try using Cascading in local platform vs Hadoop on your desired
data set; the local mode runs really fast but can only scale to the current machine. The Hadoop one is considerably
slower on one machine, however it scales to as many nodes as your Hadoop cluster has.
You can find such a test in our suite - see the Cascading module.
Hope this helps,
[1] Elasticsearch Platform — Find real-time answers at scale | Elastic
On 4/17/14 11:16 PM, Igor Romanov wrote:
Hi all,
Currently I am working with elasticsearch-hadoop library with EsOutputFormat that is writing to elasticsearch,
But it looks to me like the writing is slow (elasticsearch-hadoop works with HTTP bulks on port 9200)
So my question is it worth to try to write something of my own that will use TransportClient of elasticsearch which will
write bulks to elasticsearch via tcp on 9300 ?
(I think it should reduce opening http sockets each bulk, also I will no longer write to Hadoop Context writable
objects, but directly to Elasticsearch bulks...)
Anyone had experience with that? can share it?
Thanks,
Igor
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to
elasticsearch+unsubscribe@googlegroups.com mailto:elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/410eee8e-780a-4b50-9f40-06aa5e9820c7%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/410eee8e-780a-4b50-9f40-06aa5e9820c7%40googlegroups.com?utm_medium=email&utm_source=footer.
For more options, visit https://groups.google.com/d/optout.
--
Costin
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/5350550C.2030805%40gmail.com.
For more options, visit https://groups.google.com/d/optout.