TL;DR: Under a full ES v9.2 stack running under Docker Compose, FSCrawler v2.10-SNAPSHOT only returns TesseractOCRParser timeout errors when trying to ingest any PDFs.
Full details
I’m currently building a local Elasticsearch 9.2 cluster via Docker Compose (it will eventually be deployed to Amazon ECS) as a custom document search engine. However, FSCrawler 2.10-SNAPSHOT keeps returning TesseractOCRParser timeout errors when trying to ingest PDFs.
I’ve got an entire stack running: three Elasticsearch nodes, a dedicated Kibana node, and a dedicated FSCrawler container. Everything is communicating correctly on the Docker Compose backplane (all the ES nodes know of each other & Kibana sees them all). FSCrawler can access the original files in its container’s bind mount. It correctly indexes text files’ contents. I’ve verified this by running queries for the text files’ contents on a data view based on FSCrawler’s index under Kibana’s Discover page. I get expected results.
However, any PDFs come up as (null) contents. I can see the CPU usage of the FSCrawler container spike — and remain high for several minutes — when it starts running the job on new PDF files. When the job ends and I look at the FSCrawler log, I see error (more accurately, WARN) messages, all of which read:
<REDACTED TIMESTAMP> WARN [f.p.e.c.f.t.TikaDocParser] Failed to extract [5000000] characters of text for <REDACTED FILEPATH>]: Unable to extract PDF content -> Unable to end a page -> TesseractOCRParser timeout
Note 1: I’ve already increased the indexed character limit to 5 million.
Note 2: None of my test PDFs are more than 2,500 non-whitspace characters
I’m unclear as to why this containerized Java binary would be timing out.
FWIW, I’d chosen FSCrawler because I’d seen it in recommendations online for ingesting various document formats from a file system and then automatically generating indices. It especially helps my use case that it handles PDFs off the shelf, has a built-in OCR feature, and can be deployed from official Docker Hub images. It also seems to be under active development with active support.
I would’ve added fscrawler as a tag to this post, but it doesn’t exist & I don’t yet have enough trust to create one.
Also may be add more memory to the FSCrawler process:
FS_JAVA_OPTS="-Xmx2048m -Xms2048m"
Well, that’s worse. The lower the better, but then it won’t send the full document to Elasticsearch of course as only the X first characters will be sent to Elasticsearch.
But they do have a lot of images? By default, if detected FSCrawler tries to do OCR on images.
Ideally, if you could share one of the document you are trying to parse, that could help. Feel free to send me a direct message if you don’t want to share the document here or on Github.
Are you running with a trial license or a commercial license?
Could you share the full logs?
Well, FSCrawler is not an official project provided by Elastic, so we did not add this tag
@dadoonet, thank you for your rapid response! I’ve found FSCrawler to be matching my team’s needs pretty perfectly. Thank you also for continuing to develop it!
Let me try the change in official Docker images (I recall looking at both image tags’ CRC; thought they were the same) and also the Java memory options. I’ll report back shortly.
A slight tangent… I hadn’t checked; is the official FSCrawler image using Java 8 or a more recent release?
Fun note: I realized with your post that I’m still pushing an alias named `*-es7` which should not exist. I’m going to remove this from the future Docker push
Let me try the change in official Docker images (I recall looking at both image tags’ CRC; thought they were the same)
Well, that’s true. It should not be there, that’s all
Is the official FSCrawler image using Java 8
The image is using eclipse-temurin:25-jdk
But I’m compiling using Java 11 as a target. I should definitely use at least Java 17.
I should have responded with this before: Thank you!
I’ve applied these changes, and the timeout warn/error is regrettably still occurring.
They have no images. They’re toy example Word docs that I saved as PDFs:
I’m running with the basic license.
Absolutely! The lastest is below. Note that there are a couple of larger PDFs mentioned in the log. It was their initial failure that led me to making the above toy examples.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.