I've read through much of the documentation for es-hadoop, but I might be
coming away with some misunderstandings.
The setup docs for elasticsearch for apache hadoop (es-hadoop) uses the
word interact which is a bit vague.
Elasticsearch for Apache Hadoop is an open-source, stand-alone,
self-contained, small library that allows Hadoop jobs (whether using
Map/Reduce or libraries built upon it such as Hive, Pig or Cascading) to
interact with Elasticsearch. Data flows bi-directionaly so that
applications can leverage transparently the Elasticsearch engine
capabilities to significantly enrich their capabilities and increase the
performance.
So, does this mean I have a separate Hadoop instance (potentially built
upon HDFS or AWS EMR) and I can query data using either the elasticsearch
(REST/Java/etc) or hadoop (Hive, Pig, Cascading) environments?
I've read through much of the documentation for es-hadoop, but I might be coming away with some misunderstandings.
The setup docs for elasticsearch for apache hadoop (es-hadoop) uses the word /interact/ which is a bit vague.
Elasticsearch for Apache Hadoop is an open-source, stand-alone, self-contained, small library that allows Hadoop
jobs (whether using Map/Reduce or libraries built upon it such as Hive, Pig or Cascading) to interact with
Elasticsearch. Data flows bi-directionaly so that applications can leverage transparently the Elasticsearch engine
capabilities to significantly enrich their capabilities and increase the performance.
So, does this mean I have a separate Hadoop instance (potentially built upon HDFS or AWS EMR) and I can query data using
either the elasticsearch (REST/Java/etc) or hadoop (Hive, Pig, Cascading) environments?
I'm not sure I understand your question. es-hadoop allows Hadoop jobs (whether they are written in
Cascading/Hive/Pig/MR) to
read/write (to be read interact) easily to/from Elasticsearch. es-hadoop provides native APIs to the aforementioned
libraries
and underneath takes care of the boiler-plate work (conversion to/from JSON, communicating with ES, handling failures) plus
adds some 'optimizations' such as using Hadoop multi-node/parallel tasks.
Note that it is entirely possible to do all these yourself (if you don't want/cannot use it).
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.