FS Crawler - Issue with OCR

Hi everyone,

I'm facing issue with FSCrawler and OCR:
I have a small PDF datasets. Some are filled with "real" text and others with scanned text.
When I activate the OCR module, I received this message:

" [I regret that I couldn’t find an OCR parser to handle image/ocr-png.Please set the OCR_STRATEGY to NO_OCR or configure yourOCR parser correctly]"

I am aware that another topic was opened on this subjet, but it is closed and I wasn' able to found a solution to my issue.
I am using the following version:

  • Elasticsearch/Kibana v7.11.2
  • FSCrawler es7-2.9
  • Tesseract-OCR v5.0.1 (2022-01-18)

I also tried with th 2.10-Snapshot version, with the same result.
I think, I have well configure the crawler, but, as I am not an IT expert, I paste below the _settings.yaml file:

name: "sample_pdf"

# required

  # define a "local" file path crawler, if running inside a docker container this must be the path INSIDE the container
  url: "\\\\xxxxxxxxxxxxxxxxx\\ElasticDataTest\\OldPDF"
  follow_symlink: false
  remove_deleted: true
  continue_on_error: false

  # scan every 5 minutes for changes in url defined above
  update_rate: "24h"

  # opional: define includes and excludes, "~" files are excluded by default if not defined below
  - "/~"

  # special handling of JSON files, should only be used if ALL files are JSON
  json_support: false
  add_as_inner_object: false

  # special handling of XML files, should only be used if ALL files are XML
  xml_support: false

  # use MD5 from filename (instead of filename) if set to false
  filename_as_id: false

  # include size ot file in index
  add_filesize: true

  # inlcude user/group of file only if needed
  attributes_support: false

  # do you REALLY want to store every file as a copy in the index ? Then set this to true
  store_source: false

  # you may want to store (partial) content of the file (see indexed_chars)
  index_content: true

  # how much data from the content of the file should be indexed (and stored inside the index), set to 0 if you need checksum, but no content at all to be indexed
  #indexed_chars: "0"
  indexed_chars: "100%"

  # usually file metadata will be stored in separate fields, if you want to keep the original set, set this to true
  raw_metadata: false

  # optional: add checksum meta (requires index_content to be set to true)
  checksum: "MD5"

  # recommmended, but will create another index
  index_folders: true

  lang_detect: true

    enabled: true
    pdf_strategy: "ocr_and_text"
    language: "eng+fra"
    path: "C:\\Program Files\\Tesseract-OCR"
    data_path: "C:\\Program Files\\Tesseract-OCR\\tessdata"

# required
  # With URL
  - url: "http://xxxxxx:9200"
  bulk_size: 1000
  flush_interval: "5s"
  byte_size: "10mb"`

Everything is running on an old Windows Server 2012 R2 server.

Could you try to add C:\\Program Files\\Tesseract-OCR to your PATH, remove the Tesseract configuration from the FSCrawler settings and restart?

Thanks for the answer.
I added the path (with "\ " - not sure if it was the good thing to do) to the file system path. I removed the 2 lines in the *.yaml file.
But, now, when I launch FSCrawler, it says that OCR is disabled.
It maybe my fault that I am not an expert in adding a new path ...

And, by the way, thanks a lot for this nice software. I hope that it would help my research center to manage a tons of old document.

If you added it to the system Path, you might have to reboot.
If it's the user Path, you need restart the cmd shell.

Sorry, I forgot to restart PowerShell...
It seems to work. OCR is now on "enabled" and seems running (except for the "WARN ... no glyph for ...").
Thanks again.

I have also another question on FSCrawling and mapping process. But, it may be better to create another topic ?

Yes. That'd be better as I marked this one as solved. :wink:

Ok, I'll do that.