Es-hadoop connector advantages over transport API

yairtheone · August 16, 2015, 10:59am

Hi,
what are the advantages , if any , to use the es-hadoop connector and not the es client http API (transportclient)?
i am currently using the transportclient and think about moving to spark , so is this easily transfer of the es-hadoop to work with spark the only advantage , or there are more advantages that i am not aware of ?
thanks

eliasah · August 17, 2015, 8:42am

The elasticsearch-hadoop connector allows you to communicates between Hadoop and Elasticsearch allowing you to leverage big data analytics and real-time search.

Supporting Map/Reduce, Cascading, Apache Hive, Apache Pig, Apache Spark and Apache Storm, it allows mimicking a a big data real time analytics system architecture such as the Lambda Architecture.

What is the lambda architecture?
The lambda architecture is split into three layers, the batch layer, the serving layer, and the speed layer.

Example :

Apache Hadoop will serve as the batch layer for heavy computing.
The Elasticsearch-Hadoop connector will serve as a speed layer to transport data between Hadoop and Elasticsearch.
Elasticsearch will serve as the serving layer who's responsible for indexing and exposing the views so that they can be queried.

There is other use cases for the es-hadoop connector which I'll not list all, but to get the idea you can consider it as a transport layer between Hadoop and Elasticsearch for big data purposes.

yairtheone · August 17, 2015, 10:20am

thanks for the reply , i meant not in the big picture , but more to the real use cases that the es-hadoop connector has advantages over using the ES transportClient to bulk index or query the ES.

eliasah · August 17, 2015, 11:14am

I'm not sure I get what you mean. What kind of use cases are you looking for?

Try to provide a list of what you might be thinking about, maybe we can help you on those specific use cases if you need the es-hadoop connector or the transportclient!

costin · August 18, 2015, 9:46am

This is covered in the [features] section in the reference docs.
The client for the most part is something internal to ES; it is fast but it also has some downsides:

it ties the "client" to the ES cluster. If you upgrade ES, one needs to upgrade the client as well. The JVM versions also ideally have to be the same. With REST this issue does not occur
the client jar is big - it's a 8+ MB file that needs to be present in your classpath. When you have jobs with 50-100 tasks, having so many copies can become an issue
last but not least, the client part can be of two flavors: node and transport. Node is problematic since a high-level task in Hadoop / Spark adds a spike of "clients" in the ES cluster that are short lived.
The transport avoid this to some degree however it goes round-robin and thus is not that efficient. Further more it also requires metadata to be sent across.

And last point, ES-Hadoop/Spark is a lot more than just a transport - it handles not only serialization and retries but also does all reads and writes in parallel.

eliasah · August 18, 2015, 10:01am

I totally agree with @costin , as his answer adds a clear technical perspective to what I have already said concerning the functional use cases.

Topic		Replies	Views
[HADOOP] Anyone used TransportClient for writing to ES from Hadoop mappers? Elasticsearch	3	459	July 6, 2017
Is it beneficial to use ES-Hadoop over ES when you are not an hadoop user already? Elasticsearch	5	957	July 5, 2017
ElasticSearch Hadoop Elasticsearch	2	353	July 6, 2017
[Hadoop] capability clarification questions Elasticsearch	2	295	July 6, 2017
Perofrmance problem on es-hadoop + spark Elasticsearch es-hadoop	5	1311	July 6, 2017

Es-hadoop connector advantages over transport API

Related topics