So I tried your document.
To make OCR work, I had to install the Tesseract language pack. Did you install it as well?
Once I did, I was able to get text content. I just push a PR as a test that shows it in action.
So I tried your document.
To make OCR work, I had to install the Tesseract language pack. Did you install it as well?
Once I did, I was able to get text content. I just push a PR as a test that shows it in action.
What is the command in windows ?
Still getting the "PDF is not supported" -
E:\Tesseract-OCR>tesseract 15857372.pdf out -l heb
Tesseract Open Source OCR Engine v5.0.0-alpha.20200328 with Leptonica
Error in pixReadStream: Pdf reading is not supported
Error in pixRead: pix not read
Error during processing.
As I said, Tesseract does not parse PDF.
Tika parses PDF and for each embedded image, sends it to Tesseract for OCR.
What is the command in windows ?
I believe you need to read the Tesseract project documentation? Tika also provides some advices. See TikaOCR - TIKA - Apache Software Foundation
Acording to tesseract documantation no need to install the language in windows only , downloading the appropriate training data -> unpack it and copy the .traineddata file into the 'tessdata' directory
which i did and still its not working..
Then I don't know. Best guess is to check your tesseract installation.
I'd generate an image from the pdf file you shared with me and try to manually send it to Tesseract to see if OCR is working well with Hebrew text.
You told me that it works well with English content.
This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.
© 2020. All Rights Reserved - Elasticsearch
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant logo are trademarks of the Apache Software Foundation in the United States and/or other countries.