Ingest pdf/doc/ppt files from HDFS to Elastic Search. Fscrawler vs es-hadoop

Guys, Hope you can shed some light on this use case. I have a use case to store thousands of documents and index them. For my purpose, Springboot app will receive the uploaded files, store the metadata/application related data in MongoDB, store the actual file in HDFS and also store the extracted content in Elastic Search or searching. By searching for any keyword, parsed content will be shown and also the ability to view the actual file ( Using Springboot/Angular4 App).

To extract contents, I was thinking about below approaches

  1. ingest attachment plugin - I did not use this because the files has to be encoded in Base64 format before sending it to the plugin. Encoding large files is a tedious task and takes lot of memory/

  2. FsCrawler - Run FsCrawler , submit the documents to its REST Endpoint which will store the extracted contents to ElasticSearch. I did a POC and it works as expected. FsCrawler using Apache Tika and supports most of the file formats.

  3. I decided to go with FsCrawlet but stumbled up on es-hadoop connector project. Will Es-Haddop satisfy this purpose. Has anyone tried this?

Do you guys have any other suggestions of tools available for this purpose?

ES-Hadoop can certainly help offload your ingestion needs from Elasticsearch, but there are some points to be aware of:

ES-Hadoop doesn't have any built in PDF processing abilities. If you were to be working on this kind of data, you would need to be familiar with the appropriate libraries (such as Apache Tika or using FsCrawler's internals).

ES-Hadoop merely provides the connection for writing data to/reading data from Elasticsearch in a Hadoop environment. This is great if all of your data is already being processed with Hadoop, but may be a bit heavy weight if you aren't already.

Regarding HTTP endpoints: An FsCrawler instance has the ability to accept a document over HTTP and send it to Elasticsearch. ES-Hadoop does not have anything like this built in, nor does most of the Hadoop Ecosystem. You would need to queue up your real-time data and send it through a stream processing library that we support (like Spark Streaming or Storm) to get this kind of single document request functionality.

Do you guys have any other suggestions of tools available for this purpose?

You could ask around in the Logstash forum about doing document processing. I know that the author of FsCrawler has an Apache Tika based plugin incubating for Logstash here and that there are some conversations around better support for Tika for these kinds of use cases.

FsCrawler is also great at what it does. It's not updated at the same cadence as other projects (like Logstash or ES-Hadoop) but it still sees 2 or 3 releases a year.

Hope that helps! Let us know if you have any other questions.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.