We're facing a challenge with a client regarding the upload and processing of PDF files to make them text-searchable. Initially, we implemented an ingest pipeline with Elasticsearch's attachment processor, leveraging Apache Tika.
However, we discovered that Apache Tika in this setup doesn't utilize OCR (Optical Character Recognition) via Tesseract, which is crucial for our needs as we deal with a significant number of vectorized PDFs.
I'm seeking guidance on two critical points:
- Is there a way to integrate OCR capabilities, specifically Tesseract, within the Elasticsearch cloud environment to handle the OCR processing of PDFs?
- Would setting up a separate server with Apache Tika, equipped with OCR functionality, and then ingesting the processed PDFs into Elasticsearch be a more viable solution? I'm cautious about this approach as it seems like a regression in our process.
Your insights into these questions would be invaluable as we strive to find the most efficient and effective solution for our document processing needs.