Hi,@dadoonet
I'm using fscrawler to index my stuff into ES and I find that fscrawler can't parse jpg in PDF though pdf_strategy: "ocr_and_text" has been set. My settings on ocr are listed below:
I have tried to parse and index a JPG file and it works fine, which indicates that ocr function is enabled.
I also noticed that a warning message when I run fscrawler in the command line:
10:43:14,998 WARN [o.a.t.p.PDFParser] J2KImageReader not loaded. JPEG2000 files will not be processed.
See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
for optional dependencies.
Wondering whether this warning may be responsible for this problem?
ES version: 7.3.0
FScrawler: fscrawler-es7-2.7
OS: Win 10
I noticed the doc you mentioned and I had a try on that.
It seems that the link for JPEG2000 support API in doc has some problems. I download it from here and added to lib, however, it turned out no difference and the same warning still showed up.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.