Tif files in fscrawler

dadoonet · May 25, 2020, 9:36am

So I tried your document.
To make OCR work, I had to install the Tesseract language pack. Did you install it as well?

Once I did, I was able to get text content. I just push a PR as a test that shows it in action.

avishai.d · May 25, 2020, 10:38am

What is the command in windows ?
Still getting the "PDF is not supported" -

E:\Tesseract-OCR>tesseract 15857372.pdf out -l heb
Tesseract Open Source OCR Engine v5.0.0-alpha.20200328 with Leptonica
Error in pixReadStream: Pdf reading is not supported
Error in pixRead: pix not read
Error during processing.

dadoonet · May 25, 2020, 11:27am

As I said, Tesseract does not parse PDF.
Tika parses PDF and for each embedded image, sends it to Tesseract for OCR.

What is the command in windows ?

I believe you need to read the Tesseract project documentation? Tika also provides some advices. See TikaOCR - TIKA - Apache Software Foundation

avishai.d · May 25, 2020, 12:48pm

Acording to tesseract documantation no need to install the language in windows only , downloading the appropriate training data -> unpack it and copy the .traineddata file into the 'tessdata' directory
which i did and still its not working..

dadoonet · May 25, 2020, 1:11pm

Then I don't know. Best guess is to check your tesseract installation.
I'd generate an image from the pdf file you shared with me and try to manually send it to Tesseract to see if OCR is working well with Hebrew text.
You told me that it works well with English content.

system · June 22, 2020, 1:11pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Fscrawler not indexing the tif / tiff files in elastic search Elasticsearch	4	595	November 14, 2019
FSCrawler - OCR not working anymore in 2.9 without Tesseract location in PATH Elasticsearch	2	598	June 29, 2022
Read image text from pdf Elasticsearch	54	5233	June 7, 2017
Not able to index content of images Elasticsearch	7	835	October 14, 2019
J2KImageReader not loaded Elasticsearch	10	988	July 22, 2020

Tif files in fscrawler

Related topics