FSCrawler 2.10-SNAPSHOT not indexing PDF content

TL;DR: Under a full ES v9.2 stack running under Docker Compose, FSCrawler v2.10-SNAPSHOT only returns TesseractOCRParser timeout errors when trying to ingest any PDFs.

Full details

I’m currently building a local Elasticsearch 9.2 cluster via Docker Compose (it will eventually be deployed to Amazon ECS) as a custom document search engine. However, FSCrawler 2.10-SNAPSHOT keeps returning TesseractOCRParser timeout errors when trying to ingest PDFs.

I’ve got an entire stack running: three Elasticsearch nodes, a dedicated Kibana node, and a dedicated FSCrawler container. Everything is communicating correctly on the Docker Compose backplane (all the ES nodes know of each other & Kibana sees them all). FSCrawler can access the original files in its container’s bind mount. It correctly indexes text files’ contents. I’ve verified this by running queries for the text files’ contents on a data view based on FSCrawler’s index under Kibana’s Discover page. I get expected results.

However, any PDFs come up as (null) contents. I can see the CPU usage of the FSCrawler container spike — and remain high for several minutes — when it starts running the job on new PDF files. When the job ends and I look at the FSCrawler log, I see error (more accurately, WARN) messages, all of which read:

<REDACTED TIMESTAMP> WARN [f.p.e.c.f.t.TikaDocParser] Failed to extract [5000000] characters of text for <REDACTED FILEPATH>]: Unable to extract PDF content -> Unable to end a page -> TesseractOCRParser timeout

  • Note 1: I’ve already increased the indexed character limit to 5 million.
  • Note 2: None of my test PDFs are more than 2,500 non-whitespace characters

I’m unclear as to why this containerized Java binary would be timing out.

FWIW, I’d chosen FSCrawler because I’d seen it in recommendations online for ingesting various document formats from a file system and then automatically generating indices. It especially helps my use case that it handles PDFs off the shelf, has a built-in OCR feature, and can be deployed from official Docker Hub images. It also seems to be under active development with active support.

I would’ve added fscrawler as a tag to this post, but it doesn’t exist & I don’t yet have enough trust to create one. :sweat_smile:

All help is appreciated in advance!

Relevant configs

Welcome!

In your `.env` file, I saw:

FSCRAWLER_VERSION=2.10-SNAPSHOT-ocr-es7

It’s meant for Elasticsearch 7.

You should just use:

FSCRAWLER_VERSION=2.10-SNAPSHOT

Also may be add more memory to the FSCrawler process:

FS_JAVA_OPTS="-Xmx2048m -Xms2048m"

Well, that’s worse. The lower the better, but then it won’t send the full document to Elasticsearch of course as only the X first characters will be sent to Elasticsearch.

But they do have a lot of images? By default, if detected FSCrawler tries to do OCR on images.

Ideally, if you could share one of the document you are trying to parse, that could help. Feel free to send me a direct message if you don’t want to share the document here or on Github.

Are you running with a trial license or a commercial license?

Could you share the full logs?

Well, FSCrawler is not an official project provided by Elastic, so we did not add this tag :wink:

1 Like

@dadoonet, thank you for your rapid response! I’ve found FSCrawler to be matching my team’s needs pretty perfectly. Thank you also for continuing to develop it!

Let me try the change in official Docker images (I recall looking at both image tags’ CRC; thought they were the same) and also the Java memory options. I’ll report back shortly.

A slight tangent… I hadn’t checked; is the official FSCrawler image using Java 8 or a more recent release?

Fun note: I realized with your post that I’m still pushing an alias named `*-es7` which should not exist. I’m going to remove this from the future Docker push :slight_smile:

Let me try the change in official Docker images (I recall looking at both image tags’ CRC; thought they were the same)

Well, that’s true. It should not be there, that’s all :smiley:

Is the official FSCrawler image using Java 8

The image is using eclipse-temurin:25-jdk

But I’m compiling using Java 11 as a target. I should definitely use at least Java 17.

1 Like

I should have responded with this before: Thank you! :person_bowing:

I’ve applied these changes, and the timeout warn/error is regrettably still occurring.

They have no images. They’re toy example Word docs that I saved as PDFs:

I’m running with the basic license.

Absolutely! The lastest is below. Note that there are a couple of larger PDFs mentioned in the log. It was their initial failure that led me to making the above toy examples.

Thanks again for the support, @dadoonet!

By way of a quick update, I did just try to re-index my PDFs with the default character limit of 100K & 2G Java memory configuration. It still failed, although quicker:

<TIMESTAMP REDACTED> INFO [f.p.e.c.f.FsParserAbstract] Run #1: job [resumes]: starting...

<TIMESTAMP REDACTED> WARN [f.p.e.c.f.t.TikaDocParser] Failed to extract [100000] characters of text for [<PATH REDACTED>/Wlodarski_dummy_supplemental_file.pdf]: Unable to extract PDF content -> Unable to end a page -> TesseractOCRParser timeout

<TIMESTAMP REDACTED> WARN [f.p.e.c.f.t.TikaDocParser] Failed to extract [100000] characters of text for [<PATH REDACTED>/Wlodarski_dummy_cv_file.pdf]: Unable to extract PDF content -> Unable to end a page -> TesseractOCRParser timeout

<TIMESTAMP REDACTED> INFO [f.p.e.c.f.FsParserAbstract] Run #1: job [resumes]: indexed [4], deleted [0], documents up to [2026-01-21T12:26:36.630385167]. Started at [2026-01-21T12:26:38.630385167], finished at [2026-01-21T12:30:49.820277737], took [PT4M11.18989257S]. Will restart at [2026-01-21T12:31:36.630385167].

I tried to reproduce and did not succeed. I have been able to inject both documents on my side with a similar (but simplified setup).

I will share when possible my setup.

What type of processor do you have?

Durn… :confused:

That will be terrific! Thank you.

I’ve been developing it on two platforms:

  • Locally on an old Intel Core (specifically an old Intel Core i7 7700K w/ 64GiB RAM) running Ubuntu 24
  • Remotely on a pretty large Amazon Web Services EC2 instance running Ubuntu 20

Just had a thought: has there ever been testing of the PDF reading over a NFS connection? My PDFs are mounted to the file system over a network in both cases.

That could explain the difference in the behavior. Could you try with a local folder so we can narrow down the error?

Looking into it now.

Here is my setup:

The .env file:

# Password for the 'elastic' user (at least 6 characters)
ES_LOCAL_PASSWORD=changeme

# Version of Elastic products
ES_LOCAL_VERSION=9.2.4

# Set the ES container name
ES_LOCAL_CONTAINER_NAME=es-fscrawler

# Set to 'basic' or 'trial' to automatically start the 30-day trial
ES_LOCAL_LICENSE=basic
#ES_LOCAL_LICENSE=trial

# Port to expose Elasticsearch HTTP API to the host
ES_LOCAL_PORT=9200
ES_LOCAL_DISK_SPACE_REQUIRED=1gb
ES_LOCAL_JAVA_OPTS="-XX:UseSVE=0 -Xms128m -Xmx2g"

# Project namespace (defaults to the current folder name if not set)
COMPOSE_PROJECT_NAME=fscrawler

# FSCrawler Settings
FSCRAWLER_VERSION=2.10-SNAPSHOT
FSCRAWLER_PORT=8080

# Optionally, you can change the log level settings
FS_JAVA_OPTS="-DLOG_LEVEL=debug -DDOC_LEVEL=debug"

The docker-compose.yml file:

---
services:
  elasticsearch:
    image: docker.elastic.co/elasticsearch/elasticsearch:${ES_LOCAL_VERSION}
    container_name: ${ES_LOCAL_CONTAINER_NAME}
    volumes:
      - dev-elasticsearch:/usr/share/elasticsearch/data
    ports:
      - 127.0.0.1:${ES_LOCAL_PORT}:9200
    environment:
      - discovery.type=single-node
      - ELASTIC_PASSWORD=${ES_LOCAL_PASSWORD}
      - xpack.security.enabled=true
      - xpack.security.http.ssl.enabled=false
      - xpack.license.self_generated.type=${ES_LOCAL_LICENSE}
      - xpack.ml.use_auto_machine_memory_percent=true
      - ES_JAVA_OPTS=${ES_LOCAL_JAVA_OPTS}
      - cluster.routing.allocation.disk.watermark.low=${ES_LOCAL_DISK_SPACE_REQUIRED}
      - cluster.routing.allocation.disk.watermark.high=${ES_LOCAL_DISK_SPACE_REQUIRED}
      - cluster.routing.allocation.disk.watermark.flood_stage=${ES_LOCAL_DISK_SPACE_REQUIRED}
    ulimits:
      memlock:
        soft: -1
        hard: -1
    healthcheck:
      test:
        [
          "CMD-SHELL",
          "curl --output /dev/null --silent --head --fail -u elastic:${ES_LOCAL_PASSWORD} http://elasticsearch:9200",
        ]
      interval: 10s
      timeout: 10s
      retries: 30

  # FSCrawler
  fscrawler:
    image: dadoonet/fscrawler:${FSCRAWLER_VERSION}
    container_name: fscrawler
    restart: always
    environment:
      - FS_JAVA_OPTS=${FS_JAVA_OPTS}
      - FSCRAWLER_ELASTICSEARCH_URLS=http://${ES_LOCAL_CONTAINER_NAME}:9200
      - FSCRAWLER_ELASTICSEARCH_USERNAME=elastic
      - FSCRAWLER_ELASTICSEARCH_PASSWORD=${ES_LOCAL_PASSWORD}
      - FSCRAWLER_REST_URL=http://fscrawler:${FSCRAWLER_PORT}
    volumes:
      - ${PWD}/docs:/tmp/es:ro
    depends_on:
      elasticsearch:
        condition: service_healthy
    ports:
      - ${FSCRAWLER_PORT}:8080
    command: --rest

volumes:
  dev-elasticsearch:

Then I put your PDF documents within a `docs` dir and ran:

docker compose up

And once they have been indexed, I did:

curl localhost:9200/fscrawler/_search -u elastic:changeme | jq

Note that I will push soonish a new Docker image anytime soon as I think there might be an issue with the alias name for the index. With this new image the doc index will be fscrawler_docs and an alias named fscrawler points on it.

SITREP 1 - 21 Jan ‘26

My FSCrawler Docker Compose service running under EC2 is still failing to index the PDFs, even when they’re stored in the EC2 instance’s local file system.

My FSCrawler service running on my local machine was able to index all PDFs I threw at it.

I noticed you added Java heap size flags and disabled Java’s scalable vector extension in your single Elasticsearch node. You also added cluster.routing.allocation.disk.watermark.* settings, yes? Could this timeout error be due to an issue in the configuration of my Elasticsearch cluster?

SITREP 2 - 22 Jan ‘26

It seems that there is indeed something misconfigured or resource-starved on the old EC2 instance I was attempting to construct this solution under.

I’ve spun up a fresh EC2 instance of the same size running Ubuntu 24. I only installed what was necessary to run my Docker Compose services.

After running into some issues with the newer configurations (my Elasticsearch nodes did not like the -XX:UseSVE=0 option [likely due to the fact that the container wasn’t running on an ARM machine] and the Java heap settings caused my nodes to die with exit code 78 [no log except garbage collection, otherwise I’d share]), the FSCrawler service operated flawlessly, even over NFS and even with larger PDFs. See the log comparisons below:

I did notice that FSCrawler 2.10-SNAPSHOT now makes two actual Elasticsearch indices and one alias. Leveraging the alias for data views worked as intended.

The only thing I can imagine was going on with my old EC2 instance: resource (likely RAM, but maybe compute) starvation. If that was the case, it’s odd that the TesseractOCRParser didn’t fire a more descriptive exception.

Thanks for all your support and guidance, @dadoonet!