How to index a large file with Elasticsearch

Hi,

I have a large file saved in pdf format in Hadoop. I want to index file contents and other file informations. I was looking around the web and finally stumbled on this here.

But here they are indexing the file by encoding the file with Base64 and saving the file to a json and creating the index from this json file.

Now what if my file has a size of say 1GB, wouldn't it take me a long time doing Base64 encoding on the file without using hadoop and creating index on it.

So my question what is most efficient way to index a large file in Hadoop with Elasticsearch.

As you are probably coding in Java, you should definitely read the file from hadoop and send it to Apache Tika then to extract the text you want to index.

Then send this extracted text (and only that) to elasticsearch.

My 2 cents.