what are the advantages , if any , to use the es-hadoop connector and not the es client http API (transportclient)?
i am currently using the transportclient and think about moving to spark , so is this easily transfer of the es-hadoop to work with spark the only advantage , or there are more advantages that i am not aware of ?
The elasticsearch-hadoop connector allows you to communicates between Hadoop and Elasticsearch allowing you to leverage big data analytics and real-time search.
Supporting Map/Reduce, Cascading, Apache Hive, Apache Pig, Apache Spark and Apache Storm, it allows mimicking a a big data real time analytics system architecture such as the Lambda Architecture.
What is the lambda architecture?
The lambda architecture is split into three layers, the batch layer, the serving layer, and the speed layer.
- Apache Hadoop will serve as the batch layer for heavy computing.
- The Elasticsearch-Hadoop connector will serve as a speed layer to transport data between Hadoop and Elasticsearch.
- Elasticsearch will serve as the serving layer who's responsible for indexing and exposing the views so that they can be queried.
There is other use cases for the es-hadoop connector which I'll not list all, but to get the idea you can consider it as a transport layer between Hadoop and Elasticsearch for big data purposes.
thanks for the reply , i meant not in the big picture , but more to the real use cases that the es-hadoop connector has advantages over using the ES transportClient to bulk index or query the ES.
I'm not sure I get what you mean. What kind of use cases are you looking for?
Try to provide a list of what you might be thinking about, maybe we can help you on those specific use cases if you need the es-hadoop connector or the transportclient!
This is covered in the [features] section in the reference docs.
The client for the most part is something internal to ES; it is fast but it also has some downsides:
- it ties the "client" to the ES cluster. If you upgrade ES, one needs to upgrade the client as well. The JVM versions also ideally have to be the same. With REST this issue does not occur
- the client jar is big - it's a 8+ MB file that needs to be present in your classpath. When you have jobs with 50-100 tasks, having so many copies can become an issue
- last but not least, the client part can be of two flavors: node and transport. Node is problematic since a high-level task in Hadoop / Spark adds a spike of "clients" in the ES cluster that are short lived.
The transport avoid this to some degree however it goes round-robin and thus is not that efficient. Further more it also requires metadata to be sent across.
And last point, ES-Hadoop/Spark is a lot more than just a transport - it handles not only serialization and retries but also does all reads and writes in parallel.
I totally agree with @costin , as his answer adds a clear technical perspective to what I have already said concerning the functional use cases.