Extracting data from documents stored in HDFS to index in Elasticsearch

Sachin054 · April 5, 2016, 12:42pm

I have a HDFS archive to store variety of documents like pdf,ms word file,ppt,csv etc. I would like to build a platform using elasticsearch to search the file or text contents. I know I can use the es-hadoop plugin to index data to from HDFS to ES. I want to know the best ways that I can extract out the textual data from the docs stored in HDFS and index the same.

Any help would be appreciated.

costin · April 5, 2016, 3:29pm

Take a look at the various plugins for elasticsearch. This is about content extraction and that's where the plugins help a lot.
In particular ES 5.0 comes with a dedicated ingest plugin based on Apache Tika:
https://www.elastic.co/guide/en/elasticsearch/plugins/master/ingest.html

Sachin054 · April 6, 2016, 5:02am

So what is the difference between ingest plugin and mapper attachments plugin ? I was referring to it as a solution. Can we point to a HDFS location using either of these ? Because I don't want to store files in ES.File should remain on HDFS ,only searchable contents should be moved to ES.

Sachin054 · April 6, 2016, 5:27am

I haven't seen any proper documentation for the ingest plugin and alpha version of ES 5.0 is just released. So at this time what would you suggest for my use-case ? I have a HDFS archive with many types of documents and want to extract out all the information to ES while keeping the doc in Hadoop itself. Please take a look at http://stackoverflow.com/questions/36419608/extracting-data-from-documents-stored-in-hdfs-to-index-in-elasticsearch/36436556#36436556 for detailed info.

anon55368183 · April 6, 2016, 2:55pm

You might want to look into ManifoldCF.
https://manifoldcf.apache.org/en_US/index.html#What+Is+Apache+ManifoldCF%3F

Just point it to your HDFS repository and define ES as output - it uses Apache Tika as well.

Sachin054 · April 7, 2016, 5:09am

Thanks for the reply! Can we use tika as it is to extract from HDFS ? Which one will be the better choice ?

anon55368183 · April 8, 2016, 9:46am

I may caused some confusion but Apache Tika is used by ManifoldCF.
You can decide on your own whether you want Tika to use the Mapper-Attachments Plugin or not (https://github.com/elastic/elasticsearch-mapper-attachments).
In fact it is quite easy to set-up and does not need advanced configuration.

Topic		Replies	Views
Ingest pdf/doc/ppt files from HDFS to Elastic Search. Fscrawler vs es-hadoop Elasticsearch es-hadoop	2	1871	January 10, 2018
Elastic Search with Hadoop Elasticsearch es-hadoop	4	746	February 26, 2018
Indexing pdf, word, text, image files Elasticsearch	2	678	April 27, 2017
Best way to use Ingest Attachment Plugin Elasticsearch ingest-pipeline	4	514	December 31, 2021
Elasticsearch-mapper-attachments where is the text content stored? Elasticsearch	4	475	November 7, 2018

Extracting data from documents stored in HDFS to index in Elasticsearch

Related topics