FScrawler: perform OCR selectively only on PDF files that do not have text

equj · June 6, 2020, 1:19am

Hello,
I'm using FScrawler (2.7) to load text from PDFs into Elasticsearch (7.6.x). Most of PDF files have text, but some of PDF files contain images of scanned text and need to be OCRed. Is there a way to configure FScrawler such as that it performs OCR only on PDF files that contain images of scanned text, but not on files that already have text?

So far I can configure it to either not to do OCR on any files (case 1) or to do it on all files (case 2). In the first case, FScrawler skips all files with images of scanned text, but loads all files with text very quickly. In the second case, it takes really long time because it OCRs all the files, including those that already have text.

Here is OCR options setting for FScrawler: https://fscrawler.readthedocs.io/en/latest/user/ocr.html

Config for case 1:

name: "test"
fs:
  url: "/path/to/data/dir"
  ocr:
    enabled: false
    pdf_strategy: 'no_ocr'

Config of case 2:

name: "test"
fs:
  url: "/path/to/data/dir"
  ocr:
    enabled: true
    pdf_strategy: 'ocr_and_text'`

P.S. I can sort them OCRed and non-OCRed files using other means and have two separate FScrawler jobs for each pile of PDF files, but before I do this, I want to check if there is an easier way to use FScrawler native features.

Thank you!

dadoonet · June 6, 2020, 8:32am

Welcome!

I don't think there's a way to do that in Tika (so in FSCrawler).
How do you know which files should be OCRed and the others? Is there a technical way to implement that?

equj · June 6, 2020, 2:50pm

Thank you for your response, David!
This clarifies things perfectly!

My thinking was to do the following operation for every file:

Extract text from a file and check how many characters in the text and how many pages the file has
If there are fewer than X characters per page, perform OCR on that file.

I was uploading PDFs to Elasticsearch using Python before and this is how I did it. I'd like to use FSCrawler from now one - let me explore if I can solve it by having two FSCrawler jobs and sorting files on whether they need to be OCRed.

P.S. Thank you so much for your work on FSCrawler!

dadoonet · June 6, 2020, 3:39pm

Smart idea. Could you open an issue so I could try to come with a solution?

Adding an option in ocr like "run_above": 500.

tallison · June 8, 2020, 1:23pm

We have a rudimentary "auto" mode for OCR'ing of PDFs. I just updated our wiki to include this -- https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=109454066. See "Option 2: Configuring OCR on Rendered Pages".

This will trigger OCR on pages if < 10 characters were extracted or more than 10 characters lack unicode mappings.

If there are better heuristics we should add, let us know!

dadoonet · June 18, 2020, 1:14pm

This is amazing! Thanks a ton @tallison.

PR is on its way here: https://github.com/dadoonet/fscrawler/pull/965

@equj you can already use the auto option by setting:

name: "test"
fs:
  url: "/path/to/data/dir"
  ocr:
    enabled: true
    pdf_strategy: 'auto'

system · July 16, 2020, 1:14pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Fscrawler ocr question Elasticsearch	2	333	October 23, 2019
Read image text from pdf Elasticsearch	54	5234	June 7, 2017
FS Crawler - Issue with OCR Elasticsearch docker	7	928	September 2, 2022
FsCrawler does not do anything, does not index pfd's Elasticsearch	4	1217	March 10, 2022
Tif files in fscrawler Elasticsearch	25	1957	June 22, 2020

FScrawler: perform OCR selectively only on PDF files that do not have text

Related topics