Let's say you have a 1,000 data files where a file can be a Word document, a PDF, a PowerPoint, a text, etc... you can store these files in the local file system, in Hadoop Distributed File System (HDFS), as a BLOB in a relational database, in Mongo GridFS, etc... when you have a need to search for data, you then index these files. You can either develop an application or use existing tool to read these files, turn them into text then index them with Elasticsearch (ES). The searchable data is stored in ES so you can search for information. Yes, you do have the option to store the "text" content, not the binary content such as Word document, PDF, etc in ES so that when you perform a search, you can also retrieve the "text" content and display it as part of a search result if you like. When you store the "text" content in ES, the index size will be big and in some cases, it's worth storing the content there and in some cases it's not. This is something that you can decide based on your data and what you want to do with your data.
With a few files, you don't need to store them in HDFS because you don't want to spend money and effort to setup a Hadoop Cluster to just store a few files like this, just use your local file system. When you have lots of files, I mean something in millions, billions, or more and you know your dataset size will grow over time, then HDFS is one of many options that you can choose. With HDFS, if you run out of diskspace, you can add a data node or more to the Hadoop Cluster easily to expand the storage and this can grow infinitely. With local file system, you'll need to add more physical drives but there is a limitation on the maximum number of physical drives that you can add to your machine.
How should these files be stored in HDFS? let's save that for a different discussion.
When you want to store files in HDFS for the purpose of processing them for whatever reasons, indexing is one of many reasons, you'll need to develop a process or an application that is capable of retrieving files from HDFS (in this case), extract the content and metadata if available, then index them with ES. Elasticsearch-Hadoop (ES-Hadoop) can help you or you can write your own app if you know what to do.
For example, you can develop a normal command-line application to do this or you can develop a MapReduce job to the same thing.
A command-line application is straight forward where it recursively goes through a list of files in a directory and process (or index) one at a time. When you have a billion files, it will take a while for this command-line application to process them all, this is when you need to develop a MapReduce job so you can cut back on the processing time by processing them in parallel, where a job can be distributed as a task to each data nodes in the cluster to process an x number of files within the same input directory in HDFS. This MapReduce job needs HDFS.
If you need a MapReduce job to index your data with ES, Elasticsearch-Hadoop can help you with that, it will save your development time, the only thing that you have to do is the part where you need to extract the text content out of the binary files like Word documents, PDFs, etc. There are many open source libraries that can help you with that, you don't need to write your own.