Hello,
I'm using FScrawler (2.7) to load text from PDFs into Elasticsearch (7.6.x). Most of PDF files have text, but some of PDF files contain images of scanned text and need to be OCRed. Is there a way to configure FScrawler such as that it performs OCR only on PDF files that contain images of scanned text, but not on files that already have text?
So far I can configure it to either not to do OCR on any files (case 1) or to do it on all files (case 2). In the first case, FScrawler skips all files with images of scanned text, but loads all files with text very quickly. In the second case, it takes really long time because it OCRs all the files, including those that already have text.
P.S. I can sort them OCRed and non-OCRed files using other means and have two separate FScrawler jobs for each pile of PDF files, but before I do this, I want to check if there is an easier way to use FScrawler native features.
I don't think there's a way to do that in Tika (so in FSCrawler).
How do you know which files should be OCRed and the others? Is there a technical way to implement that?
Thank you for your response, David!
This clarifies things perfectly!
My thinking was to do the following operation for every file:
Extract text from a file and check how many characters in the text and how many pages the file has
If there are fewer than X characters per page, perform OCR on that file.
I was uploading PDFs to Elasticsearch using Python before and this is how I did it. I'd like to use FSCrawler from now one - let me explore if I can solve it by having two FSCrawler jobs and sorting files on whether they need to be OCRed.
P.S. Thank you so much for your work on FSCrawler!
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.