Extracting data from documents stored in HDFS to index in Elasticsearch

(Sachin Shaju) #1

I have a HDFS archive to store variety of documents like pdf,ms word file,ppt,csv etc. I would like to build a platform using elasticsearch to search the file or text contents. I know I can use the es-hadoop plugin to index data to from HDFS to ES. I want to know the best ways that I can extract out the textual data from the docs stored in HDFS and index the same.

Any help would be appreciated.

(Costin Leau) #2

Take a look at the various plugins for elasticsearch. This is about content extraction and that's where the plugins help a lot.
In particular ES 5.0 comes with a dedicated ingest plugin based on Apache Tika:

(Sachin Shaju) #3

So what is the difference between ingest plugin and mapper attachments plugin ? I was referring to it as a solution. Can we point to a HDFS location using either of these ? Because I don't want to store files in ES.File should remain on HDFS ,only searchable contents should be moved to ES.

(Sachin Shaju) #4

I haven't seen any proper documentation for the ingest plugin and alpha version of ES 5.0 is just released. So at this time what would you suggest for my use-case ? I have a HDFS archive with many types of documents and want to extract out all the information to ES while keeping the doc in Hadoop itself. Please take a look at http://stackoverflow.com/questions/36419608/extracting-data-from-documents-stored-in-hdfs-to-index-in-elasticsearch/36436556#36436556 for detailed info.


You might want to look into ManifoldCF.

Just point it to your HDFS repository and define ES as output - it uses Apache Tika as well.

(Sachin Shaju) #6

Thanks for the reply! Can we use tika as it is to extract from HDFS ? Which one will be the better choice ?


I may caused some confusion but Apache Tika is used by ManifoldCF.
You can decide on your own whether you want Tika to use the Mapper-Attachments Plugin or not (https://github.com/elastic/elasticsearch-mapper-attachments).
In fact it is quite easy to set-up and does not need advanced configuration.

(system) #8