I am using Tesseract-OCR to extract text from pdfs with fscrawler. I have tried configuring _settings.yaml with the path and data_path for Tesseract and also tried it without the path and data_path. However, when I use Kibana to look at the content of the OCR-ed pdfs, I just get new lines or empty space.
I also used jpgs and pngs to test if Tesseract can OCR those files, but fscrawler is not even reading jpgs or pngs. No content is extracted from jpgs or pngs.
On Python, I used pytesseract to check if Tesseract works on a jpg file with the same image as the pdfs I fed to fscrawler and pytesseract was able to pick up the text.
My questions are:
Is Tesseract-OCR just not working with fscrawler and how can I tell? (The command prompt logs seem fine. There is no error saying that OCR is not going to be performed.)
Do I need to convert the pdfs to jpgs first for Tesseract to work and if that is the case, what should I do about the issue that fscrawler is not reading jpgs?
Thanks for your help!