I'm trying to understand where Elasticsearch for Hadoop fits in the big
data landscape and why someone would use it.
If you wan't all the data in Haddop searchable, doesn't that mean
everything needs all the data duplicated in Elasticsearch (via es-hadoop)?
Can you push all data from Elasticsearch into Hadoop whit es-hadoop,
instead of the reverse?
Here's my idea:
Short-term (1 week) real-time searchable (kibana) data is kept in
Elasticsearch
Long-term (1 year+) high-latency searchable (hbase,pig et al.) data
kept in Hadoop
Typically one would use es-hadoop if they are already Hadoop users. As for your questions:
Yes and no. To search data one has to index but not necessarily store the data. For convenience so that the data is
returned along with the results, let's assume the worst case scenario where data is stored as well. However one would
have to do so as well even use a pure Hadoop implementation - letting aside the fact that one would have to write the
search algos using Map/Reduce which is not at all easy (think Geolocation) - all the intermediate steps and keys (think
shuffling, key/output values) between input and output, would be saved to disk which results in data being duplicated on each job.
Elasticseach aside, for data to be useable, searchable, indexed, etc... there needs to be some metadata - this is either
packed with the data or created along the way. Since you mentioned HBase and Pig, take a look at their requirements.
Yes, es-hadoop is bidirectional so one can stream data in ES to from HDFS for example or stream data from ES to HDFS.
However while ES can be used as a store, it's much more valuable if you use it for its search/insight capabilities hence
why typically one would read search results from ES not just raw data.
If you haven't seen it so far, I recommend the latest webinar [1] which features es-hadoop and provides a complete
picture of what es-hadooop is.
I'm trying to understand where Elasticsearch for Hadoop fits in the big data landscape and why someone would use it.
If you wan't all the data in Haddop searchable, doesn't that mean everything needs all the data duplicated in
Elasticsearch (via es-hadoop)?
Can you push all data from Elasticsearch into Hadoop whit es-hadoop, instead of the reverse?
Here's my idea:
Short-term (1 week) real-time searchable (kibana) data is kept in Elasticsearch
Long-term (1 year+) high-latency searchable (hbase,pig et al.) data kept in Hadoop
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.