How is Hadoop and ES typically used?

Would I store massive data in Hadoop cluster and then using ES plugin to index the data to ES, but disabling the source field in ES.

So then query ES which returns back a bunch of document ids and if I want to get the actual documents get them from Hadoop?

Typically one would index the data in Elasticsearch and do the queries there. You can disable _source and only hold an index but you don't really want to do that since performance will suffer significantly. Basically, based on your queries, Elastic will know what data matches but since it doesn't has the data, it will only have some type of pointer / uuid that you define to where the data is actually located.
For each match, Elastic will give you the pointer and you'll have to get the data yourself. This is the classical N+1 problem meaning for 1 call (the query to Elasticsearch), you'll end up with N results which will result in N calls, in this case to Hadoop/HDFS.
Further more, HDFS is not fast and each call is likely to be over the network.

Elasticsearch is quite efficient in compressing data and if you have information that's not required, you can skip it out. Further more, with the information available, Elasticsearch can apply aggregations that is introspect the data automatically.

Do note that pretty much every engine requires the raw data to be transformed into its own format - otherwise for every job one would have to recreate the index/fast-format which is computationally expensive. Disk on the other hand, is significantly cheaper.
And in case of Elasticsearch, data can be easily partitioned into indices; each can be snapshot-ed and later restored (imported) very quickly without doing any reindexing. In other words, you have plenty of means to move data in and out of your Elasticsearch cluster, with or without reindexing.

1 Like

So the way i read this is... If I already have a hadoop cluster I can index it to ES and then use ES to do searching and analytics.

I'm starting fresh here so from my point of view I don't really need Hadoop then...

Yes. Note besides the docs, there are also some webinars (like this one) on the topic that you might want to look at.

Does storing the data only in elasticsearch all right?
Can you do all the distributed data processing stuff that you do with Hadoop?

@shubhamgupta1404 Not sure what you are asking. To learn more about what Elasticsearch is, simply take a look at - there are tons of docs/blogs/webinars on what it does.

I mean to ask that if you store the data on HDFS as well as on elasticsearch cluster, won't it lead to data replication and thereby use a lot of extra space?

Further, what are the disadvantages/advantages of storing the data only in elasticsearch?

The data in HDFS is in raw format - the data in ES is indexed and thus can be searched/analyzed.

Note that in ES, one can select what to store and thus remove the bits that are not needed.
Also note that HDFS assumes disk space is "infinite" as during Map/Reduce, computing jobs a lot of shuffling takes place and thus one can end up with several times the amount of data inserted as "temporary" data which is stored on HDFS.
It varies widely but my point is in Hadoop, it's a safe bet to have 50% or more free space depending on your job otherwise you might run into troubles.

As for your second question that's really for you to answer; ES has no issue being either the primary or secondary source of data. ES however is not advertised as a pure data store since this "feature" is secondary to its purpose, namely search and analytics.