Unable to extract PDF content

I'm trying to extract text from a pdf, but I get the following:

Unable to extract PDF content -> Unable to end a page -> I regret that I couldn't find an OCR parser to handle image/ocr-png.Please set the OCR_STRATEGY to NO_OCR or configure yourOCR parser correctly

My configuration looks like this:

---
name: "ocr_docs"
fs:
  url: "C:\\Users\\Documents\\docs2ocr"
  update_rate: "15m"
  excludes:
  - "*\\~*"
  json_support: false
  filename_as_id: false
  add_filesize: true
  remove_deleted: true
  add_as_inner_object: false
  store_source: false
  index_content: true
  attributes_support: true
  raw_metadata: true
  xml_support: false
  index_folders: true
  lang_detect: false
  continue_on_error: false
  ocr:
    language: "eng"
    enabled: true
    pdf_strategy: "ocr_and_text"
    path: "C:\\Program Files\\https%3a%2f%2fmirrors.163.com%2fcygwin%2f\\x86_64\\release\\tesseract-ocr\\tesseract-ocr-5.3.3-1\\usr\\bin"
    data_path: "C:\\Program Files\\https%3a%2f%2fmirrors.163.com%2fcygwin%2f\\x86_64\\release\\tesseract-ocr\\tesseract-ocr-5.3.3-1\\usr\\share\\tessdata"
  follow_symlinks: false
elasticsearch:
  nodes:
  - cloud_id: "cloud_id"
  username: "username"
  password: "password"
  bulk_size: 100
  flush_interval: "5s"
  byte_size: "10mb"

I'm using fscrawler-distribution-2.10-20231023.160816-291

Welcome!

I think you should find a more recent build than "20231023".
The latest I can see is fscrawler-distribution-2.10-20240213.145447-315.zip.

But I don't think that will fix your problem. Could you share your PDF document so I can try it out? If you can't share it publicly, you can send me a private message.

Thank you! I sent you a dm with the pdf doc. Please let me know if more information is needed

So I tried your document and was able to get its content. Something like *** 18 de septiembre de 2013 *** (skipping the rest)...

May be this does not work on windows or that was caused by an old build of FSCrawler?

Could you try with the most recent build?
If this does not work, could you try with the FSCrawler Docker images?

I tried that build and didn't work. However, it worked after using the FSCrawler Docker images. Thanks for the help :grinning:

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.