Hi,
I have a large file saved in pdf format in Hadoop. I want to index file contents and other file informations. I was looking around the web and finally stumbled on this here.
But here they are indexing the file by encoding the file with Base64 and saving the file to a json and creating the index from this json file.
Now what if my file has a size of say 1GB, wouldn't it take me a long time doing Base64 encoding on the file without using hadoop and creating index on it.
So my question what is most efficient way to index a large file in Hadoop with Elasticsearch.