I have a HDFS archive to store variety of documents like pdf,ms word file,ppt,csv etc. I would like to build a platform using elasticsearch to search the file or text contents. I know I can use the es-hadoop plugin to index data to from HDFS to ES. I want to know the best ways that I can extract out the textual data from the docs stored in HDFS and index the same.
So what is the difference between ingest plugin and mapper attachments plugin ? I was referring to it as a solution. Can we point to a HDFS location using either of these ? Because I don't want to store files in ES.File should remain on HDFS ,only searchable contents should be moved to ES.
I may caused some confusion but Apache Tika is used by ManifoldCF.
You can decide on your own whether you want Tika to use the Mapper-Attachments Plugin or not (https://github.com/elastic/elasticsearch-mapper-attachments).
In fact it is quite easy to set-up and does not need advanced configuration.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.