Understanding Elasticsearch-Hadoop

Hi,

Even after going through so many resources and reading about es-hadoop i am
unable to clarify some of my doubts like:

How to run elasticsearch data nodes on your hadoop data nodes??
Can i install an elasticsearch cluster and store indexes on hadoop HDFS??
if yes then how??
Will i have to keep two copies of data?? One copy in hadoop and one in
elasticsearch??

PS: i have gone through this url
also. https://github.com/elastic/elasticsearch/issues/9072

Please do answer or point to some useful resources which can explain the
architecture of setting up es and hadoop together in tight coupled mode. To
me, the official documentation of es-hadoop on the elastic website does not
provide good understanding.

Regards,
Bharvi Dixit

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/17d2deea-b750-4b2b-ba55-098aebb9f31f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Hi,

Hadoop means a lot of things as it has a lot of components. I'm sorry to hear the resources you read don't give you
enough answers.

The 'definition' of Elasticsearch Hadoop is given in the documentation preface [1] which I quote below:

"
Elasticsearch for Apache Hadoop is an ‘umbrella’ project consisting of three similar, yet independent sub-projects with
their own, dedicated, section in the documentation:

  • Elasticsearch on YARN
    run Elasticsearch on top of YARN - see Elasticsearch on YARN
  • repository-hdfs
    use HDFS as a repository back-end; that is storage for doing snapshot/restore from/to Elasticsearch. For more
    information refer to its home page
  • elasticsearch-hadoop proper
    interact with Elasticsearch from within a Hadoop environment. If you are using Map/Reduce, Cascading, Hive, Pig,
    Apache Spark or Apache Storm, this project is for you.
    "

Note that none of these definitions match your questions. That is a good indicator that using Elasticsearch for Apache
Hadoop won't address your issues.

How to run elasticsearch data nodes on your hadoop data nodes??

Do you have a certain expectation in mind? Simply install Elasticsearch on that machine and start it.

Can i install an elasticsearch cluster and store indexes on hadoop HDFS?? if yes then how??

You can but it is not recommended and the issue you linked explains this. If you want to store you indexes on HDFS
(again, make sure you understand what you're doing) you can mount HDFS as a local partition through its NFS gateway.
This is not Elasticsearch specific - the entire functionality is provided by Hadoop 2.x.

Will i have to keep two copies of data?? One copy in hadoop and one in elasticsearch??

Any engine that works with data relies on its own data structures in order to understand and work with the data stored
in it. This is a generic concept that applies to all storage solutions out there including Hadoop/HDFS (if you don't
have any 'schema' associated with your data, you need to create one at runtime, additionally things like indexes and
such are either stored on disk separately or created at runtime).

In other words you will have some metadata besides your raw data - there's no going around it. Depending on how
fine-grained your data is, this can be higher or bigger. Additionally, you can save the data in Elasticsearch (typically
a good idea for aggregations and such) or not - you can chose to save only an id or pointer to the raw data. But every
time you'll want to read the data, Elasticsearch won't have it so it will be up to you to retrieve it based on the
id/pointer you stored in Elasticsearch. This is not only complicated but significantly slow (it's the typical n+1
problem - 1 search/call returns N results, for each one you have to make an additional call).

Last but not least, there's a presentation by yours truly from Elasticon that gives a tour of Elasticsearch Hadoop.
Unfortunately the video is not available but you can find the slides here [2]

[1] Preface | Elasticsearch for Apache Hadoop [master] | Elastic
[2] https://www.elastic.co/elasticon/2015/sf/elasticsearch-hadoop-friends-spark-storm-and-more

On 4/4/15 8:55 PM, Bharvi Dixit wrote:

Hi,

Even after going through so many resources and reading about es-hadoop i am unable to clarify some of my doubts like:

How to run elasticsearch data nodes on your hadoop data nodes??
Can i install an elasticsearch cluster and store indexes on hadoop HDFS?? if yes then how??
Will i have to keep two copies of data?? One copy in hadoop and one in elasticsearch??

PS: i have gone through this url also. Support for storing indices on HDFS · Issue #9072 · elastic/elasticsearch · GitHub

Please do answer or point to some useful resources which can explain the architecture of setting up es and hadoop
together in tight coupled mode. To me, the official documentation of es-hadoop on the elastic website does not provide
good understanding.

Regards,
Bharvi Dixit

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to
elasticsearch+unsubscribe@googlegroups.com mailto:elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/17d2deea-b750-4b2b-ba55-098aebb9f31f%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/17d2deea-b750-4b2b-ba55-098aebb9f31f%40googlegroups.com?utm_medium=email&utm_source=footer.
For more options, visit https://groups.google.com/d/optout.

--
Costin

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/55202E0F.90507%40gmail.com.
For more options, visit https://groups.google.com/d/optout.