Hello,
Upgrading FSCrawler from 2.7 to 2.9 I noticed that with our configuration OCR wasn't working anymore. In our _settings.yaml file we set the path to Tesseract we like below:
ocr:
language: "eng+nld"
path: "D:\\opt\\Tesseract-OCR"
data_path: "D:\\opt\\Tesseract-OCR\\tessdata"
enabled: true
pdf_strategy: "auto"
The logging from FSCrawler 2.7 shows OCR working correct:
14:13:42,549 TRACE [f.p.e.c.f.t.TikaDocParser] Beginning Tika extraction
14:13:42,567 DEBUG [f.p.e.c.f.t.TikaInstance] OCR is activated.
14:13:42,572 DEBUG [f.p.e.c.f.t.TikaInstance] But Tesseract is not installed so we won't run OCR.
14:13:42,920 WARN [o.a.t.p.PDFParser] J2KImageReader not loaded. JPEG2000 files will not be processed.
See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
for optional dependencies.
14:13:43,230 DEBUG [f.p.e.c.f.t.TikaInstance] OCR is activated so we need to configure Tesseract in case we have specific settings.
14:13:43,231 DEBUG [f.p.e.c.f.t.TikaInstance] Tesseract Path set to [D:\opt\Tesseract-OCR].
14:13:43,232 DEBUG [f.p.e.c.f.t.TikaInstance] Tesseract Data Path set to [D:\opt\Tesseract-OCR\tessdata].
14:13:43,232 DEBUG [f.p.e.c.f.t.TikaInstance] Tesseract Language set to [eng+nld].
14:13:43,396 DEBUG [o.a.t.p.o.TesseractOCRParser] Tesseract command: D:\opt\Tesseract-OCR\tesseract.exe C:\Windows\TEMP\apache-tika-752489536816618607.tmp C:\Windows\TEMP\apache-tika-2999124896453799227.tmp -l eng+nld --psm 1 -c page_separator= -c preserve_interword_spaces=0 txt
14:13:46,839 DEBUG [o.a.t.p.o.TesseractOCRParser]
14:13:46,845 DEBUG [o.a.t.p.o.TesseractOCRParser] Tesseract Open Source OCR Engine v5.0.0-alpha.20201127 with Leptonica
Page 1
14:13:46,858 TRACE [f.p.e.c.f.t.TikaDocParser] End of Tika extraction
14:13:47,265 TRACE [f.p.e.c.f.c.v.ElasticsearchClientV7] Sending a bulk request of [1] requests
14:13:47,376 TRACE [f.p.e.c.f.c.v.ElasticsearchClientV7] Executed bulk request with [1] requests
14:13:48,407 TRACE [f.p.e.c.f.t.TikaDocParser] Main detected language: [nl: HIGH (0.999996)]
14:13:48,415 TRACE [f.p.e.c.f.t.TikaDocParser] End document generation
14:13:48,415 TRACE [f.p.e.c.f.f.FsCrawlerUtil] No pattern always matches.
14:13:48,415 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(D:\DMS\DOCD2, D:\DMS\DOCD2\harold3.tif) = \harold3.tif
14:13:48,439 DEBUG [f.p.e.c.f.s.FsCrawlerDocumentServiceElasticsearchImpl] Indexing psf94038_data_v1/5438744b41121ee2d59cdd51b9c141f8?pipeline=null
14:13:48,439 TRACE [f.p.e.c.f.c.v.ElasticsearchClientV7] JSon indexed : {
"content" : "...........",
"meta" : {
"language" : "nl"
}
Using the same configuration the logging from 2.9 shows OCR not working anymore:
14:20:46,006 TRACE [f.p.e.c.f.t.TikaDocParser] Beginning Tika extraction
14:20:46,020 DEBUG [f.p.e.c.f.t.TikaInstance] OCR is activated.
14:20:46,027 DEBUG [f.p.e.c.f.t.TikaInstance] Tesseract Path set to [D:\opt\Tesseract-OCR].
14:20:46,027 DEBUG [f.p.e.c.f.t.TikaInstance] Tesseract Data Path set to [D:\opt\Tesseract-OCR\tessdata].
14:20:46,074 DEBUG [o.a.t.p.o.TesseractOCRParser] hasTesseract (path: [D:\opt\Tesseract-OCR\tesseract.exe]): true
14:20:46,075 DEBUG [f.p.e.c.f.t.TikaInstance] OCR strategy for PDF documents is [auto] and tesseract was found.
14:20:46,075 INFO [f.p.e.c.f.t.TikaInstance] OCR is enabled. This might slowdown the process.
14:20:46,423 DEBUG [o.a.t.p.o.TesseractOCRParser] hasTesseract (path: [tesseract.exe]): false
14:20:46,424 DEBUG [o.a.t.p.o.TesseractOCRParser] ImageMagick does not appear to be installed (commandline: magick)
14:20:46,571 DEBUG [o.a.t.p.o.TesseractOCRParser] hasTesseract (path: [tesseract.exe]): false
14:20:46,572 DEBUG [o.a.t.p.o.TesseractOCRParser] ImageMagick does not appear to be installed (commandline: magick)
14:20:47,516 TRACE [f.p.e.c.f.t.TikaDocParser] End of Tika extraction
14:20:48,937 TRACE [f.p.e.c.f.t.TikaDocParser] Main detected language: [: NONE (0.000000)]
14:20:48,945 TRACE [f.p.e.c.f.t.TikaDocParser] End document generation
14:20:48,945 TRACE [f.p.e.c.f.f.FsCrawlerUtil] Null or empty content always matches.
14:20:48,945 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(D:\DMS\DOCD2, D:\DMS\DOCD2\harold3.tif) = \harold3.tif
14:20:48,954 DEBUG [f.p.e.c.f.s.FsCrawlerDocumentServiceElasticsearchImpl] Indexing psf94038_data_v1/5438744b41121ee2d59cdd51b9c141f8?pipeline=null
14:20:48,955 TRACE [f.p.e.c.f.c.v.ElasticsearchClientV7] JSon indexed : {
"meta" : { },
}
After adding the location to the PATH variable on the system the OCR works correctly again.
Regards,
Harold