Hi everyone,
I'm facing issue with FSCrawler and OCR:
I have a small PDF datasets. Some are filled with "real" text and others with scanned text.
When I activate the OCR module, I received this message:
" [I regret that I couldn’t find an OCR parser to handle image/ocr-png.Please set the OCR_STRATEGY to NO_OCR or configure yourOCR parser correctly]"
I am aware that another topic was opened on this subjet, but it is closed and I wasn' able to found a solution to my issue.
I am using the following version:
- Elasticsearch/Kibana v7.11.2
- FSCrawler es7-2.9
- Tesseract-OCR v5.0.1 (2022-01-18)
I also tried with th 2.10-Snapshot version, with the same result.
I think, I have well configure the crawler, but, as I am not an IT expert, I paste below the _settings.yaml file:
name: "sample_pdf"
# required
fs:
# define a "local" file path crawler, if running inside a docker container this must be the path INSIDE the container
url: "\\\\xxxxxxxxxxxxxxxxx\\ElasticDataTest\\OldPDF"
follow_symlink: false
remove_deleted: true
continue_on_error: false
# scan every 5 minutes for changes in url defined above
update_rate: "24h"
# opional: define includes and excludes, "~" files are excluded by default if not defined below
excludes:
- "/~"
# special handling of JSON files, should only be used if ALL files are JSON
json_support: false
add_as_inner_object: false
# special handling of XML files, should only be used if ALL files are XML
xml_support: false
# use MD5 from filename (instead of filename) if set to false
filename_as_id: false
# include size ot file in index
add_filesize: true
# inlcude user/group of file only if needed
attributes_support: false
# do you REALLY want to store every file as a copy in the index ? Then set this to true
store_source: false
# you may want to store (partial) content of the file (see indexed_chars)
index_content: true
# how much data from the content of the file should be indexed (and stored inside the index), set to 0 if you need checksum, but no content at all to be indexed
#indexed_chars: "0"
indexed_chars: "100%"
# usually file metadata will be stored in separate fields, if you want to keep the original set, set this to true
raw_metadata: false
# optional: add checksum meta (requires index_content to be set to true)
checksum: "MD5"
# recommmended, but will create another index
index_folders: true
lang_detect: true
ocr:
enabled: true
pdf_strategy: "ocr_and_text"
language: "eng+fra"
path: "C:\\Program Files\\Tesseract-OCR"
data_path: "C:\\Program Files\\Tesseract-OCR\\tessdata"
# required
elasticsearch:
nodes:
# With URL
- url: "http://xxxxxx:9200"
bulk_size: 1000
flush_interval: "5s"
byte_size: "10mb"`
Everything is running on an old Windows Server 2012 R2 server.