How to Enable OCR in Elasticsearch for Enhanced PDF Readability?

Chenko · January 17, 2024, 9:56am

Hello,

We're facing a challenge with a client regarding the upload and processing of PDF files to make them text-searchable. Initially, we implemented an ingest pipeline with Elasticsearch's attachment processor, leveraging Apache Tika.

However, we discovered that Apache Tika in this setup doesn't utilize OCR (Optical Character Recognition) via Tesseract, which is crucial for our needs as we deal with a significant number of vectorized PDFs.

I'm seeking guidance on two critical points:

Is there a way to integrate OCR capabilities, specifically Tesseract, within the Elasticsearch cloud environment to handle the OCR processing of PDFs?
Would setting up a separate server with Apache Tika, equipped with OCR functionality, and then ingesting the processed PDFs into Elasticsearch be a more viable solution? I'm cautious about this approach as it seems like a regression in our process.

Your insights into these questions would be invaluable as we strive to find the most efficient and effective solution for our document processing needs.

dadoonet · January 17, 2024, 9:59am

Not really. Even with Workplace Search, I can see that in docs:

If you cannot select the text, this means the PDF is actually an image. You will have to use a 3rd party OCR (optical character recognition) engine to scan the image for text and ingest via a custom source. This process can be hit and miss, depending on the quality of the image and the font used.

That's what FSCrawler project is doing.

system · February 14, 2024, 9:59am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Can any one knows how to enable OCR in es with Ingest mapper plugin? Elasticsearch elastic-stack-alerting	4	403	June 1, 2020
Index image files with OCR Elasticsearch	3	2638	April 29, 2017
How to use OCR in Elasticsearch ingest attachment plugin? Elasticsearch ingest-pipeline	12	6008	March 4, 2021
Ingest-Attachment: Enabling OCR Elasticsearch	2	1290	October 12, 2020
OCR support for ES Mapper attachments plugin Elasticsearch	1	1228	July 6, 2017

How to Enable OCR in Elasticsearch for Enhanced PDF Readability?

Related topics