FScrawler: perform OCR selectively only on PDF files that do not have text

Hello,
I'm using FScrawler (2.7) to load text from PDFs into Elasticsearch (7.6.x). Most of PDF files have text, but some of PDF files contain images of scanned text and need to be OCRed. Is there a way to configure FScrawler such as that it performs OCR only on PDF files that contain images of scanned text, but not on files that already have text?

So far I can configure it to either not to do OCR on any files (case 1) or to do it on all files (case 2). In the first case, FScrawler skips all files with images of scanned text, but loads all files with text very quickly. In the second case, it takes really long time because it OCRs all the files, including those that already have text.

Here is OCR options setting for FScrawler: https://fscrawler.readthedocs.io/en/latest/user/ocr.html

Config for case 1:

name: "test"
fs:
  url: "/path/to/data/dir"
  ocr:
    enabled: false
    pdf_strategy: 'no_ocr'

Config of case 2:

name: "test"
fs:
  url: "/path/to/data/dir"
  ocr:
    enabled: true
    pdf_strategy: 'ocr_and_text'`

P.S. I can sort them OCRed and non-OCRed files using other means and have two separate FScrawler jobs for each pile of PDF files, but before I do this, I want to check if there is an easier way to use FScrawler native features.

Thank you!

Welcome!

I don't think there's a way to do that in Tika (so in FSCrawler).
How do you know which files should be OCRed and the others? Is there a technical way to implement that?

1 Like

Thank you for your response, David!
This clarifies things perfectly!

My thinking was to do the following operation for every file:

  1. Extract text from a file and check how many characters in the text and how many pages the file has
  2. If there are fewer than X characters per page, perform OCR on that file.

I was uploading PDFs to Elasticsearch using Python before and this is how I did it. I'd like to use FSCrawler from now one - let me explore if I can solve it by having two FSCrawler jobs and sorting files on whether they need to be OCRed.

P.S. Thank you so much for your work on FSCrawler!

Smart idea. Could you open an issue so I could try to come with a solution?

Adding an option in ocr like "run_above": 500.

1 Like

We have a rudimentary "auto" mode for OCR'ing of PDFs. I just updated our wiki to include this -- https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=109454066. See "Option 2: Configuring OCR on Rendered Pages".

This will trigger OCR on pages if < 10 characters were extracted or more than 10 characters lack unicode mappings.

If there are better heuristics we should add, let us know!

2 Likes

This is amazing! Thanks a ton @tallison.

PR is on its way here: https://github.com/dadoonet/fscrawler/pull/965

@equj you can already use the auto option by setting:

name: "test"
fs:
  url: "/path/to/data/dir"
  ocr:
    enabled: true
    pdf_strategy: 'auto'

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.