Unable to extract PDF content

NitzaAg · February 29, 2024, 9:15pm

I'm trying to extract text from a pdf, but I get the following:

Unable to extract PDF content -> Unable to end a page -> I regret that I couldn't find an OCR parser to handle image/ocr-png.Please set the OCR_STRATEGY to NO_OCR or configure yourOCR parser correctly

My configuration looks like this:

---
name: "ocr_docs"
fs:
  url: "C:\\Users\\Documents\\docs2ocr"
  update_rate: "15m"
  excludes:
  - "*\\~*"
  json_support: false
  filename_as_id: false
  add_filesize: true
  remove_deleted: true
  add_as_inner_object: false
  store_source: false
  index_content: true
  attributes_support: true
  raw_metadata: true
  xml_support: false
  index_folders: true
  lang_detect: false
  continue_on_error: false
  ocr:
    language: "eng"
    enabled: true
    pdf_strategy: "ocr_and_text"
    path: "C:\\Program Files\\https%3a%2f%2fmirrors.163.com%2fcygwin%2f\\x86_64\\release\\tesseract-ocr\\tesseract-ocr-5.3.3-1\\usr\\bin"
    data_path: "C:\\Program Files\\https%3a%2f%2fmirrors.163.com%2fcygwin%2f\\x86_64\\release\\tesseract-ocr\\tesseract-ocr-5.3.3-1\\usr\\share\\tessdata"
  follow_symlinks: false
elasticsearch:
  nodes:
  - cloud_id: "cloud_id"
  username: "username"
  password: "password"
  bulk_size: 100
  flush_interval: "5s"
  byte_size: "10mb"

I'm using fscrawler-distribution-2.10-20231023.160816-291

dadoonet · February 29, 2024, 11:10pm

Welcome!

I think you should find a more recent build than "20231023".
The latest I can see is fscrawler-distribution-2.10-20240213.145447-315.zip.

But I don't think that will fix your problem. Could you share your PDF document so I can try it out? If you can't share it publicly, you can send me a private message.

NitzaAg · March 1, 2024, 4:43pm

Thank you! I sent you a dm with the pdf doc. Please let me know if more information is needed

dadoonet · March 3, 2024, 12:33pm

So I tried your document and was able to get its content. Something like *** 18 de septiembre de 2013 *** (skipping the rest)...

May be this does not work on windows or that was caused by an old build of FSCrawler?

Could you try with the most recent build?
If this does not work, could you try with the FSCrawler Docker images?

NitzaAg · March 12, 2024, 10:41pm

I tried that build and didn't work. However, it worked after using the FSCrawler Docker images. Thanks for the help

system · April 14, 2024, 9:17pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
I regret that I couldn't find an OCR parser to handle image/ocr-png.Please set the OCR_STRATEGY to NO_OCR or configure yourOCR parser correctly Elasticsearch	9	1269	April 10, 2022
Can't see the text content in images that are inside pdf or word file Elasticsearch	2	325	June 5, 2019
Read image text from pdf Elasticsearch	54	5234	June 7, 2017
FScrawler not parsing jpg in PDF Elasticsearch	8	1322	April 1, 2020
Could not see OCR text in "content" field Elasticsearch	19	1293	August 3, 2020

Unable to extract PDF content

Related topics