TL;DR: Under a full ES v9.2 stack running under Docker Compose, FSCrawler v2.10-SNAPSHOT only returns TesseractOCRParser timeout errors when trying to ingest any PDFs.
Full details
I’m currently building a local Elasticsearch 9.2 cluster via Docker Compose (it will eventually be deployed to Amazon ECS) as a custom document search engine. However, FSCrawler 2.10-SNAPSHOT keeps returning TesseractOCRParser timeout errors when trying to ingest PDFs.
I’ve got an entire stack running: three Elasticsearch nodes, a dedicated Kibana node, and a dedicated FSCrawler container. Everything is communicating correctly on the Docker Compose backplane (all the ES nodes know of each other & Kibana sees them all). FSCrawler can access the original files in its container’s bind mount. It correctly indexes text files’ contents. I’ve verified this by running queries for the text files’ contents on a data view based on FSCrawler’s index under Kibana’s Discover page. I get expected results.
However, any PDFs come up as (null) contents. I can see the CPU usage of the FSCrawler container spike — and remain high for several minutes — when it starts running the job on new PDF files. When the job ends and I look at the FSCrawler log, I see error (more accurately, WARN) messages, all of which read:
<REDACTED TIMESTAMP> WARN [f.p.e.c.f.t.TikaDocParser] Failed to extract [5000000] characters of text for <REDACTED FILEPATH>]: Unable to extract PDF content -> Unable to end a page -> TesseractOCRParser timeout
- Note 1: I’ve already increased the indexed character limit to 5 million.
- Note 2: None of my test PDFs are more than 2,500 non-whitespace characters
I’m unclear as to why this containerized Java binary would be timing out.
FWIW, I’d chosen FSCrawler because I’d seen it in recommendations online for ingesting various document formats from a file system and then automatically generating indices. It especially helps my use case that it handles PDFs off the shelf, has a built-in OCR feature, and can be deployed from official Docker Hub images. It also seems to be under active development with active support.
I would’ve added fscrawler as a tag to this post, but it doesn’t exist & I don’t yet have enough trust to create one. ![]()
All help is appreciated in advance!
Relevant configs
- Docker Compose YAML: Full Docker Compose configuration for Elasticsearch + Kibana cluster with FSCrawler · GitHub
- FSCrawler _settings.yaml: FSCrawler settings YAML (container path /root/.fscrawler/resumes/; copied into container post-ES build to facilitate api_key setting) · GitHub
- .env file: Common .env file (contains all FSCrawler configs) · GitHub