Extracting data from documents stored in HDFS to index in Elasticsearch


(Sachin Shaju) #1

I have a HDFS archive to store variety of documents like pdf,ms word file,ppt,csv etc. I would like to build a platform using elasticsearch to search the file or text contents. I know I can use the es-hadoop plugin to index data to from HDFS to ES. I want to know the best ways that I can extract out the textual data from the docs stored in HDFS and index the same.

Any help would be appreciated.


(Costin Leau) #2

Take a look at the various plugins for elasticsearch. This is about content extraction and that's where the plugins help a lot.
In particular ES 5.0 comes with a dedicated ingest plugin based on Apache Tika:
https://www.elastic.co/guide/en/elasticsearch/plugins/master/ingest.html


(Sachin Shaju) #3

So what is the difference between ingest plugin and mapper attachments plugin ? I was referring to it as a solution. Can we point to a HDFS location using either of these ? Because I don't want to store files in ES.File should remain on HDFS ,only searchable contents should be moved to ES.


(Sachin Shaju) #4

I haven't seen any proper documentation for the ingest plugin and alpha version of ES 5.0 is just released. So at this time what would you suggest for my use-case ? I have a HDFS archive with many types of documents and want to extract out all the information to ES while keeping the doc in Hadoop itself. Please take a look at http://stackoverflow.com/questions/36419608/extracting-data-from-documents-stored-in-hdfs-to-index-in-elasticsearch/36436556#36436556 for detailed info.


#5

You might want to look into ManifoldCF.
https://manifoldcf.apache.org/en_US/index.html#What+Is+Apache+ManifoldCF%3F

Just point it to your HDFS repository and define ES as output - it uses Apache Tika as well.


(Sachin Shaju) #6

Thanks for the reply! Can we use tika as it is to extract from HDFS ? Which one will be the better choice ?


#7

I may caused some confusion but Apache Tika is used by ManifoldCF.
You can decide on your own whether you want Tika to use the Mapper-Attachments Plugin or not (https://github.com/elastic/elasticsearch-mapper-attachments).
In fact it is quite easy to set-up and does not need advanced configuration.


(system) #8