Typically one would index the data in Elasticsearch and do the queries there. You can disable _source and only hold an index but you don't really want to do that since performance will suffer significantly. Basically, based on your queries, Elastic will know what data matches but since it doesn't has the data, it will only have some type of pointer / uuid that you define to where the data is actually located.
For each match, Elastic will give you the pointer and you'll have to get the data yourself. This is the classical N+1 problem meaning for 1 call (the query to Elasticsearch), you'll end up with N results which will result in N calls, in this case to Hadoop/HDFS.
Further more, HDFS is not fast and each call is likely to be over the network.
Elasticsearch is quite efficient in compressing data and if you have information that's not required, you can skip it out. Further more, with the information available, Elasticsearch can apply aggregations that is introspect the data automatically.
Do note that pretty much every engine requires the raw data to be transformed into its own format - otherwise for every job one would have to recreate the index/fast-format which is computationally expensive. Disk on the other hand, is significantly cheaper.
And in case of Elasticsearch, data can be easily partitioned into indices; each can be snapshot-ed and later restored (imported) very quickly without doing any reindexing. In other words, you have plenty of means to move data in and out of your Elasticsearch cluster, with or without reindexing.