I'm facing issue with FSCrawler and OCR:
I have a small PDF datasets. Some are filled with "real" text and others with scanned text.
When I activate the OCR module, I received this message:
" [I regret that I couldn’t find an OCR parser to handle image/ocr-png.Please set the OCR_STRATEGY to NO_OCR or configure yourOCR parser correctly]"
I am aware that another topic was opened on this subjet, but it is closed and I wasn' able to found a solution to my issue.
I am using the following version:
- Elasticsearch/Kibana v7.11.2
- FSCrawler es7-2.9
- Tesseract-OCR v5.0.1 (2022-01-18)
I also tried with th 2.10-Snapshot version, with the same result.
I think, I have well configure the crawler, but, as I am not an IT expert, I paste below the _settings.yaml file:
name: "sample_pdf" # required fs: # define a "local" file path crawler, if running inside a docker container this must be the path INSIDE the container url: "\\\\xxxxxxxxxxxxxxxxx\\ElasticDataTest\\OldPDF" follow_symlink: false remove_deleted: true continue_on_error: false # scan every 5 minutes for changes in url defined above update_rate: "24h" # opional: define includes and excludes, "~" files are excluded by default if not defined below excludes: - "/~" # special handling of JSON files, should only be used if ALL files are JSON json_support: false add_as_inner_object: false # special handling of XML files, should only be used if ALL files are XML xml_support: false # use MD5 from filename (instead of filename) if set to false filename_as_id: false # include size ot file in index add_filesize: true # inlcude user/group of file only if needed attributes_support: false # do you REALLY want to store every file as a copy in the index ? Then set this to true store_source: false # you may want to store (partial) content of the file (see indexed_chars) index_content: true # how much data from the content of the file should be indexed (and stored inside the index), set to 0 if you need checksum, but no content at all to be indexed #indexed_chars: "0" indexed_chars: "100%" # usually file metadata will be stored in separate fields, if you want to keep the original set, set this to true raw_metadata: false # optional: add checksum meta (requires index_content to be set to true) checksum: "MD5" # recommmended, but will create another index index_folders: true lang_detect: true ocr: enabled: true pdf_strategy: "ocr_and_text" language: "eng+fra" path: "C:\\Program Files\\Tesseract-OCR" data_path: "C:\\Program Files\\Tesseract-OCR\\tessdata" # required elasticsearch: nodes: # With URL - url: "http://xxxxxx:9200" bulk_size: 1000 flush_interval: "5s" byte_size: "10mb"`
Everything is running on an old Windows Server 2012 R2 server.