Ingest-Attachment: Enabling OCR

Hi there,

I'm trying to index some various documents using the Ingest-Attachment plugin. Some of these documents are image-based pdf documents that require OCR processing.

According to this page on StackOverflow, Ingest-Attchment (or rather the contained Tika implementation) can be configured to execute Tesseract by pointing to the correct directory where Tesseract is installed. In my case, I would have to include tesseractPath=C:\Program Files (x86)\Tesseract-OCR to the Tika properties file.

I'm having difficulty, however, finding the properties file to update with the Tesseract path. There is a properties file in the elasticsearch/plugins/ingest-attachment folder, but simply adding to that file does not seem to do the trick. Starting Elasticsearch with this line included fails, showing the following error message:

java.lang.IllegalArgumentException: Unknown properties in plugin descriptor: [tesseractPath]

The tesseract installation works just fine standalone (has been tested). Does anyone know what I'm doing wrong and how to fix it?

Cheers!

Think this has your answer which supports the error you are seeing.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.