How to Enable OCR in Elasticsearch for Enhanced PDF Readability?


We're facing a challenge with a client regarding the upload and processing of PDF files to make them text-searchable. Initially, we implemented an ingest pipeline with Elasticsearch's attachment processor, leveraging Apache Tika.

However, we discovered that Apache Tika in this setup doesn't utilize OCR (Optical Character Recognition) via Tesseract, which is crucial for our needs as we deal with a significant number of vectorized PDFs.

I'm seeking guidance on two critical points:

  1. Is there a way to integrate OCR capabilities, specifically Tesseract, within the Elasticsearch cloud environment to handle the OCR processing of PDFs?
  2. Would setting up a separate server with Apache Tika, equipped with OCR functionality, and then ingesting the processed PDFs into Elasticsearch be a more viable solution? I'm cautious about this approach as it seems like a regression in our process.

Your insights into these questions would be invaluable as we strive to find the most efficient and effective solution for our document processing needs.

Not really. Even with Workplace Search, I can see that in docs:

If you cannot select the text, this means the PDF is actually an image. You will have to use a 3rd party OCR (optical character recognition) engine to scan the image for text and ingest via a custom source. This process can be hit and miss, depending on the quality of the image and the font used.

That's what FSCrawler project is doing.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.