FSCrawler - OCR not working anymore in 2.9 without Tesseract location in PATH

HaroldH · June 1, 2022, 12:47pm

Hello,

Upgrading FSCrawler from 2.7 to 2.9 I noticed that with our configuration OCR wasn't working anymore. In our _settings.yaml file we set the path to Tesseract we like below:

  ocr:
    language: "eng+nld"
    path: "D:\\opt\\Tesseract-OCR"
    data_path: "D:\\opt\\Tesseract-OCR\\tessdata"
    enabled: true
    pdf_strategy: "auto"

The logging from FSCrawler 2.7 shows OCR working correct:

14:13:42,549 TRACE [f.p.e.c.f.t.TikaDocParser] Beginning Tika extraction
14:13:42,567 DEBUG [f.p.e.c.f.t.TikaInstance] OCR is activated.
14:13:42,572 DEBUG [f.p.e.c.f.t.TikaInstance] But Tesseract is not installed so we won't run OCR.
14:13:42,920 WARN  [o.a.t.p.PDFParser] J2KImageReader not loaded. JPEG2000 files will not be processed.
See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
for optional dependencies.

14:13:43,230 DEBUG [f.p.e.c.f.t.TikaInstance] OCR is activated so we need to configure Tesseract in case we have specific settings.
14:13:43,231 DEBUG [f.p.e.c.f.t.TikaInstance] Tesseract Path set to [D:\opt\Tesseract-OCR].
14:13:43,232 DEBUG [f.p.e.c.f.t.TikaInstance] Tesseract Data Path set to [D:\opt\Tesseract-OCR\tessdata].
14:13:43,232 DEBUG [f.p.e.c.f.t.TikaInstance] Tesseract Language set to [eng+nld].
14:13:43,396 DEBUG [o.a.t.p.o.TesseractOCRParser] Tesseract command: D:\opt\Tesseract-OCR\tesseract.exe C:\Windows\TEMP\apache-tika-752489536816618607.tmp C:\Windows\TEMP\apache-tika-2999124896453799227.tmp -l eng+nld --psm 1 -c page_separator= -c preserve_interword_spaces=0 txt
14:13:46,839 DEBUG [o.a.t.p.o.TesseractOCRParser] 
14:13:46,845 DEBUG [o.a.t.p.o.TesseractOCRParser] Tesseract Open Source OCR Engine v5.0.0-alpha.20201127 with Leptonica
Page 1

14:13:46,858 TRACE [f.p.e.c.f.t.TikaDocParser] End of Tika extraction
14:13:47,265 TRACE [f.p.e.c.f.c.v.ElasticsearchClientV7] Sending a bulk request of [1] requests
14:13:47,376 TRACE [f.p.e.c.f.c.v.ElasticsearchClientV7] Executed bulk request with [1] requests
14:13:48,407 TRACE [f.p.e.c.f.t.TikaDocParser] Main detected language: [nl: HIGH (0.999996)]
14:13:48,415 TRACE [f.p.e.c.f.t.TikaDocParser] End document generation
14:13:48,415 TRACE [f.p.e.c.f.f.FsCrawlerUtil] No pattern always matches.
14:13:48,415 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(D:\DMS\DOCD2, D:\DMS\DOCD2\harold3.tif) = \harold3.tif
14:13:48,439 DEBUG [f.p.e.c.f.s.FsCrawlerDocumentServiceElasticsearchImpl] Indexing psf94038_data_v1/5438744b41121ee2d59cdd51b9c141f8?pipeline=null
14:13:48,439 TRACE [f.p.e.c.f.c.v.ElasticsearchClientV7] JSon indexed : {
  "content" : "...........",
  "meta" : {
    "language" : "nl"
  }

Using the same configuration the logging from 2.9 shows OCR not working anymore:

14:20:46,006 TRACE [f.p.e.c.f.t.TikaDocParser] Beginning Tika extraction
14:20:46,020 DEBUG [f.p.e.c.f.t.TikaInstance] OCR is activated.
14:20:46,027 DEBUG [f.p.e.c.f.t.TikaInstance] Tesseract Path set to [D:\opt\Tesseract-OCR].
14:20:46,027 DEBUG [f.p.e.c.f.t.TikaInstance] Tesseract Data Path set to [D:\opt\Tesseract-OCR\tessdata].
14:20:46,074 DEBUG [o.a.t.p.o.TesseractOCRParser] hasTesseract (path: [D:\opt\Tesseract-OCR\tesseract.exe]): true
14:20:46,075 DEBUG [f.p.e.c.f.t.TikaInstance] OCR strategy for PDF documents is [auto] and tesseract was found.
14:20:46,075 INFO  [f.p.e.c.f.t.TikaInstance] OCR is enabled. This might slowdown the process.
14:20:46,423 DEBUG [o.a.t.p.o.TesseractOCRParser] hasTesseract (path: [tesseract.exe]): false
14:20:46,424 DEBUG [o.a.t.p.o.TesseractOCRParser] ImageMagick does not appear to be installed (commandline: magick)
14:20:46,571 DEBUG [o.a.t.p.o.TesseractOCRParser] hasTesseract (path: [tesseract.exe]): false
14:20:46,572 DEBUG [o.a.t.p.o.TesseractOCRParser] ImageMagick does not appear to be installed (commandline: magick)
14:20:47,516 TRACE [f.p.e.c.f.t.TikaDocParser] End of Tika extraction
14:20:48,937 TRACE [f.p.e.c.f.t.TikaDocParser] Main detected language: [: NONE (0.000000)]
14:20:48,945 TRACE [f.p.e.c.f.t.TikaDocParser] End document generation
14:20:48,945 TRACE [f.p.e.c.f.f.FsCrawlerUtil] Null or empty content always matches.
14:20:48,945 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(D:\DMS\DOCD2, D:\DMS\DOCD2\harold3.tif) = \harold3.tif
14:20:48,954 DEBUG [f.p.e.c.f.s.FsCrawlerDocumentServiceElasticsearchImpl] Indexing psf94038_data_v1/5438744b41121ee2d59cdd51b9c141f8?pipeline=null
14:20:48,955 TRACE [f.p.e.c.f.c.v.ElasticsearchClientV7] JSon indexed : {
  "meta" : { },
}

After adding the location to the PATH variable on the system the OCR works correctly again.

Regards,
Harold

dadoonet · June 1, 2022, 9:46pm

Could you try with 2.10-SNAPSHOT?

system · June 29, 2022, 9:47pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Tif files in fscrawler Elasticsearch	25	1957	June 22, 2020
Not able to index content of images Elasticsearch	7	835	October 14, 2019
FScrawler not parsing jpg in PDF Elasticsearch	8	1322	April 1, 2020
Tesseract-OCR only returns new lines Elasticsearch	10	1378	June 25, 2020
Read image text from pdf Elasticsearch	54	5234	June 7, 2017

FSCrawler - OCR not working anymore in 2.9 without Tesseract location in PATH

Related topics