FSCrawler 2.10-SNAPSHOT not indexing PDF content

DaManDOH · January 16, 2026, 9:24pm

TL;DR: Under a full ES v9.2 stack running under Docker Compose, FSCrawler v2.10-SNAPSHOT only returns TesseractOCRParser timeout errors when trying to ingest any PDFs.

Full details

I’m currently building a local Elasticsearch 9.2 cluster via Docker Compose (it will eventually be deployed to Amazon ECS) as a custom document search engine. However, FSCrawler 2.10-SNAPSHOT keeps returning TesseractOCRParser timeout errors when trying to ingest PDFs.

I’ve got an entire stack running: three Elasticsearch nodes, a dedicated Kibana node, and a dedicated FSCrawler container. Everything is communicating correctly on the Docker Compose backplane (all the ES nodes know of each other & Kibana sees them all). FSCrawler can access the original files in its container’s bind mount. It correctly indexes text files’ contents. I’ve verified this by running queries for the text files’ contents on a data view based on FSCrawler’s index under Kibana’s Discover page. I get expected results.

However, any PDFs come up as (null) contents. I can see the CPU usage of the FSCrawler container spike — and remain high for several minutes — when it starts running the job on new PDF files. When the job ends and I look at the FSCrawler log, I see error (more accurately, WARN) messages, all of which read:

<REDACTED TIMESTAMP> WARN [f.p.e.c.f.t.TikaDocParser] Failed to extract [5000000] characters of text for <REDACTED FILEPATH>]: Unable to extract PDF content -> Unable to end a page -> TesseractOCRParser timeout

Note 1: I’ve already increased the indexed character limit to 5 million.
Note 2: None of my test PDFs are more than 2,500 non-whitespace characters

I’m unclear as to why this containerized Java binary would be timing out.

FWIW, I’d chosen FSCrawler because I’d seen it in recommendations online for ingesting various document formats from a file system and then automatically generating indices. It especially helps my use case that it handles PDFs off the shelf, has a built-in OCR feature, and can be deployed from official Docker Hub images. It also seems to be under active development with active support.

I would’ve added fscrawler as a tag to this post, but it doesn’t exist & I don’t yet have enough trust to create one.

All help is appreciated in advance!

Relevant configs

Docker Compose YAML: Full Docker Compose configuration for Elasticsearch + Kibana cluster with FSCrawler · GitHub
FSCrawler _settings.yaml: FSCrawler settings YAML (container path /root/.fscrawler/resumes/; copied into container post-ES build to facilitate api_key setting) · GitHub
.env file: Common .env file (contains all FSCrawler configs) · GitHub

dadoonet · January 16, 2026, 10:49pm

Welcome!

In your `.env` file, I saw:

FSCRAWLER_VERSION=2.10-SNAPSHOT-ocr-es7

It’s meant for Elasticsearch 7.

You should just use:

FSCRAWLER_VERSION=2.10-SNAPSHOT

Also may be add more memory to the FSCrawler process:

FS_JAVA_OPTS="-Xmx2048m -Xms2048m"

Well, that’s worse. The lower the better, but then it won’t send the full document to Elasticsearch of course as only the X first characters will be sent to Elasticsearch.

But they do have a lot of images? By default, if detected FSCrawler tries to do OCR on images.

Ideally, if you could share one of the document you are trying to parse, that could help. Feel free to send me a direct message if you don’t want to share the document here or on Github.

Are you running with a trial license or a commercial license?

Could you share the full logs?

Well, FSCrawler is not an official project provided by Elastic, so we did not add this tag

DaManDOH · January 16, 2026, 10:54pm

@dadoonet, thank you for your rapid response! I’ve found FSCrawler to be matching my team’s needs pretty perfectly. Thank you also for continuing to develop it!

Let me try the change in official Docker images (I recall looking at both image tags’ CRC; thought they were the same) and also the Java memory options. I’ll report back shortly.

A slight tangent… I hadn’t checked; is the official FSCrawler image using Java 8 or a more recent release?

dadoonet · January 16, 2026, 11:01pm

Fun note: I realized with your post that I’m still pushing an alias named `*-es7` which should not exist. I’m going to remove this from the future Docker push

Let me try the change in official Docker images (I recall looking at both image tags’ CRC; thought they were the same)

Well, that’s true. It should not be there, that’s all

Is the official FSCrawler image using Java 8

The image is using eclipse-temurin:25-jdk

But I’m compiling using Java 11 as a target. I should definitely use at least Java 17.

DaManDOH · January 16, 2026, 11:52pm

I should have responded with this before: Thank you!

dadoonet:

In your .env file, I saw:
FSCRAWLER_VERSION=2.10-SNAPSHOT-ocr-es7
It’s meant for Elasticsearch 7.

You should just use:
FSCRAWLER_VERSION=2.10-SNAPSHOT
Also may be add more memory to the FSCrawler process:
FS_JAVA_OPTS="-Xmx2048m -Xms2048m"

I’ve applied these changes, and the timeout warn/error is regrettably still occurring.

They have no images. They’re toy example Word docs that I saved as PDFs:

I’m running with the basic license.

Absolutely! The lastest is below. Note that there are a couple of larger PDFs mentioned in the log. It was their initial failure that led me to making the above toy examples.

gist.github.com

https://gist.github.com/DaManDOH/140f1cf0910811c83ce7495c7631ea71

fscrawler_service.log

23:19:26,944 INFO  [f.console] ,----------------------------------------------------------------------------------------------------.
|       ,---,.  .--.--.     ,----..                                     ,--,           2.10-SNAPSHOT |
|     ,'  .' | /  /    '.  /   /   \                                  ,--.'|                         |
|   ,---.'   ||  :  /`. / |   :     :  __  ,-.                   .---.|  | :               __  ,-.   |
|   |   |   .';  |  |--`  .   |  ;. /,' ,'/ /|                  /. ./|:  : '             ,' ,'/ /|   |
|   :   :  :  |  :  ;_    .   ; /--` '  | |' | ,--.--.       .-'-. ' ||  ' |      ,---.  '  | |' |   |
|   :   |  |-, \  \    `. ;   | ;    |  |   ,'/       \     /___/ \: |'  | |     /     \ |  |   ,'   |
|   |   :  ;/|  `----.   \|   : |    '  :  / .--.  .-. | .-'.. '   ' .|  | :    /    /  |'  :  /     |
|   |   |   .'  __ \  \  |.   | '___ |  | '   \__\/: . ./___/ \:     ''  : |__ .    ' / ||  | '      |
|   '   :  '   /  /`--'  /'   ; : .'|;  : |   ," .--.; |.   \  ' .\   |  | '.'|'   ;   /|;  : |      |

This file has been truncated. show original

Thanks again for the support, @dadoonet!

DaManDOH · January 21, 2026, 5:53pm

By way of a quick update, I did just try to re-index my PDFs with the default character limit of 100K & 2G Java memory configuration. It still failed, although quicker:

<TIMESTAMP REDACTED> INFO [f.p.e.c.f.FsParserAbstract] Run #1: job [resumes]: starting...

<TIMESTAMP REDACTED> WARN [f.p.e.c.f.t.TikaDocParser] Failed to extract [100000] characters of text for [<PATH REDACTED>/Wlodarski_dummy_supplemental_file.pdf]: Unable to extract PDF content -> Unable to end a page -> TesseractOCRParser timeout

<TIMESTAMP REDACTED> WARN [f.p.e.c.f.t.TikaDocParser] Failed to extract [100000] characters of text for [<PATH REDACTED>/Wlodarski_dummy_cv_file.pdf]: Unable to extract PDF content -> Unable to end a page -> TesseractOCRParser timeout

<TIMESTAMP REDACTED> INFO [f.p.e.c.f.FsParserAbstract] Run #1: job [resumes]: indexed [4], deleted [0], documents up to [2026-01-21T12:26:36.630385167]. Started at [2026-01-21T12:26:38.630385167], finished at [2026-01-21T12:30:49.820277737], took [PT4M11.18989257S]. Will restart at [2026-01-21T12:31:36.630385167].

dadoonet · January 21, 2026, 8:31pm

I tried to reproduce and did not succeed. I have been able to inject both documents on my side with a similar (but simplified setup).

I will share when possible my setup.

What type of processor do you have?

DaManDOH · January 21, 2026, 9:02pm

Durn…

That will be terrific! Thank you.

I’ve been developing it on two platforms:

Locally on an old Intel Core (specifically an old Intel Core i7 7700K w/ 64GiB RAM) running Ubuntu 24
Remotely on a pretty large Amazon Web Services EC2 instance running Ubuntu 20

Just had a thought: has there ever been testing of the PDF reading over a NFS connection? My PDFs are mounted to the file system over a network in both cases.

dadoonet · January 21, 2026, 9:14pm

That could explain the difference in the behavior. Could you try with a local folder so we can narrow down the error?

DaManDOH · January 21, 2026, 9:32pm

Looking into it now.

dadoonet · January 21, 2026, 10:43pm

Here is my setup:

The .env file:

# Password for the 'elastic' user (at least 6 characters)
ES_LOCAL_PASSWORD=changeme

# Version of Elastic products
ES_LOCAL_VERSION=9.2.4

# Set the ES container name
ES_LOCAL_CONTAINER_NAME=es-fscrawler

# Set to 'basic' or 'trial' to automatically start the 30-day trial
ES_LOCAL_LICENSE=basic
#ES_LOCAL_LICENSE=trial

# Port to expose Elasticsearch HTTP API to the host
ES_LOCAL_PORT=9200
ES_LOCAL_DISK_SPACE_REQUIRED=1gb
ES_LOCAL_JAVA_OPTS="-XX:UseSVE=0 -Xms128m -Xmx2g"

# Project namespace (defaults to the current folder name if not set)
COMPOSE_PROJECT_NAME=fscrawler

# FSCrawler Settings
FSCRAWLER_VERSION=2.10-SNAPSHOT
FSCRAWLER_PORT=8080

# Optionally, you can change the log level settings
FS_JAVA_OPTS="-DLOG_LEVEL=debug -DDOC_LEVEL=debug"

The docker-compose.yml file:

---
services:
  elasticsearch:
    image: docker.elastic.co/elasticsearch/elasticsearch:${ES_LOCAL_VERSION}
    container_name: ${ES_LOCAL_CONTAINER_NAME}
    volumes:
      - dev-elasticsearch:/usr/share/elasticsearch/data
    ports:
      - 127.0.0.1:${ES_LOCAL_PORT}:9200
    environment:
      - discovery.type=single-node
      - ELASTIC_PASSWORD=${ES_LOCAL_PASSWORD}
      - xpack.security.enabled=true
      - xpack.security.http.ssl.enabled=false
      - xpack.license.self_generated.type=${ES_LOCAL_LICENSE}
      - xpack.ml.use_auto_machine_memory_percent=true
      - ES_JAVA_OPTS=${ES_LOCAL_JAVA_OPTS}
      - cluster.routing.allocation.disk.watermark.low=${ES_LOCAL_DISK_SPACE_REQUIRED}
      - cluster.routing.allocation.disk.watermark.high=${ES_LOCAL_DISK_SPACE_REQUIRED}
      - cluster.routing.allocation.disk.watermark.flood_stage=${ES_LOCAL_DISK_SPACE_REQUIRED}
    ulimits:
      memlock:
        soft: -1
        hard: -1
    healthcheck:
      test:
        [
          "CMD-SHELL",
          "curl --output /dev/null --silent --head --fail -u elastic:${ES_LOCAL_PASSWORD} http://elasticsearch:9200",
        ]
      interval: 10s
      timeout: 10s
      retries: 30

  # FSCrawler
  fscrawler:
    image: dadoonet/fscrawler:${FSCRAWLER_VERSION}
    container_name: fscrawler
    restart: always
    environment:
      - FS_JAVA_OPTS=${FS_JAVA_OPTS}
      - FSCRAWLER_ELASTICSEARCH_URLS=http://${ES_LOCAL_CONTAINER_NAME}:9200
      - FSCRAWLER_ELASTICSEARCH_USERNAME=elastic
      - FSCRAWLER_ELASTICSEARCH_PASSWORD=${ES_LOCAL_PASSWORD}
      - FSCRAWLER_REST_URL=http://fscrawler:${FSCRAWLER_PORT}
    volumes:
      - ${PWD}/docs:/tmp/es:ro
    depends_on:
      elasticsearch:
        condition: service_healthy
    ports:
      - ${FSCRAWLER_PORT}:8080
    command: --rest

volumes:
  dev-elasticsearch:

Then I put your PDF documents within a `docs` dir and ran:

docker compose up

And once they have been indexed, I did:

curl localhost:9200/fscrawler/_search -u elastic:changeme | jq

Note that I will push soonish a new Docker image anytime soon as I think there might be an issue with the alias name for the index. With this new image the doc index will be fscrawler_docs and an alias named fscrawler points on it.

DaManDOH · January 22, 2026, 12:33am

SITREP 1 - 21 Jan ‘26

My FSCrawler Docker Compose service running under EC2 is still failing to index the PDFs, even when they’re stored in the EC2 instance’s local file system.

gist.github.com

https://gist.github.com/DaManDOH/db06b266f0c6776af32132d3da2e355d

ec2_fscrawler_service.log

17:47:52,054 INFO  [f.console] ,----------------------------------------------------------------------------------------------------.
|       ,---,.  .--.--.     ,----..                                     ,--,           2.10-SNAPSHOT |
|     ,'  .' | /  /    '.  /   /   \                                  ,--.'|                         |
|   ,---.'   ||  :  /`. / |   :     :  __  ,-.                   .---.|  | :               __  ,-.   |
|   |   |   .';  |  |--`  .   |  ;. /,' ,'/ /|                  /. ./|:  : '             ,' ,'/ /|   |
|   :   :  :  |  :  ;_    .   ; /--` '  | |' | ,--.--.       .-'-. ' ||  ' |      ,---.  '  | |' |   |
|   :   |  |-, \  \    `. ;   | ;    |  |   ,'/       \     /___/ \: |'  | |     /     \ |  |   ,'   |
|   |   :  ;/|  `----.   \|   : |    '  :  / .--.  .-. | .-'.. '   ' .|  | :    /    /  |'  :  /     |
|   |   |   .'  __ \  \  |.   | '___ |  | '   \__\/: . ./___/ \:     ''  : |__ .    ' / ||  | '      |
|   '   :  '   /  /`--'  /'   ; : .'|;  : |   ," .--.; |.   \  ' .\   |  | '.'|'   ;   /|;  : |      |

This file has been truncated. show original

My FSCrawler service running on my local machine was able to index all PDFs I threw at it.

gist.github.com

https://gist.github.com/DaManDOH/439202472109c682717b7927a138aed3

local_fscrawler_service.log

19:10:08,468 INFO  [f.console] ,----------------------------------------------------------------------------------------------------.
|       ,---,.  .--.--.     ,----..                                     ,--,           2.10-SNAPSHOT |
|     ,'  .' | /  /    '.  /   /   \                                  ,--.'|                         |
|   ,---.'   ||  :  /`. / |   :     :  __  ,-.                   .---.|  | :               __  ,-.   |
|   |   |   .';  |  |--`  .   |  ;. /,' ,'/ /|                  /. ./|:  : '             ,' ,'/ /|   |
|   :   :  :  |  :  ;_    .   ; /--` '  | |' | ,--.--.       .-'-. ' ||  ' |      ,---.  '  | |' |   |
|   :   |  |-, \  \    `. ;   | ;    |  |   ,'/       \     /___/ \: |'  | |     /     \ |  |   ,'   |
|   |   :  ;/|  `----.   \|   : |    '  :  / .--.  .-. | .-'.. '   ' .|  | :    /    /  |'  :  /     |
|   |   |   .'  __ \  \  |.   | '___ |  | '   \__\/: . ./___/ \:     ''  : |__ .    ' / ||  | '      |
|   '   :  '   /  /`--'  /'   ; : .'|;  : |   ," .--.; |.   \  ' .\   |  | '.'|'   ;   /|;  : |      |

This file has been truncated. show original

I noticed you added Java heap size flags and disabled Java’s scalable vector extension in your single Elasticsearch node. You also added cluster.routing.allocation.disk.watermark.* settings, yes? Could this timeout error be due to an issue in the configuration of my Elasticsearch cluster?

SITREP 2 - 22 Jan ‘26

It seems that there is indeed something misconfigured or resource-starved on the old EC2 instance I was attempting to construct this solution under.

I’ve spun up a fresh EC2 instance of the same size running Ubuntu 24. I only installed what was necessary to run my Docker Compose services.

After running into some issues with the newer configurations (my Elasticsearch nodes did not like the -XX:UseSVE=0 option [likely due to the fact that the container wasn’t running on an ARM machine] and the Java heap settings caused my nodes to die with exit code 78 [no log except garbage collection, otherwise I’d share]), the FSCrawler service operated flawlessly, even over NFS and even with larger PDFs. See the log comparisons below:

gist.github.com

https://gist.github.com/DaManDOH/769f7d54a027bd6f42ed3e5a4517c95f

fscrawler_service_on_new_ec2.log

13:12:23,238 INFO  [f.console] ,----------------------------------------------------------------------------------------------------.
|       ,---,.  .--.--.     ,----..                                     ,--,           2.10-SNAPSHOT |
|     ,'  .' | /  /    '.  /   /   \                                  ,--.'|                         |
|   ,---.'   ||  :  /`. / |   :     :  __  ,-.                   .---.|  | :               __  ,-.   |
|   |   |   .';  |  |--`  .   |  ;. /,' ,'/ /|                  /. ./|:  : '             ,' ,'/ /|   |
|   :   :  :  |  :  ;_    .   ; /--` '  | |' | ,--.--.       .-'-. ' ||  ' |      ,---.  '  | |' |   |
|   :   |  |-, \  \    `. ;   | ;    |  |   ,'/       \     /___/ \: |'  | |     /     \ |  |   ,'   |
|   |   :  ;/|  `----.   \|   : |    '  :  / .--.  .-. | .-'.. '   ' .|  | :    /    /  |'  :  /     |
|   |   |   .'  __ \  \  |.   | '___ |  | '   \__\/: . ./___/ \:     ''  : |__ .    ' / ||  | '      |
|   '   :  '   /  /`--'  /'   ; : .'|;  : |   ," .--.; |.   \  ' .\   |  | '.'|'   ;   /|;  : |      |

This file has been truncated. show original

fscrawler_service_on_old_ec2.log

13:26:54,469 INFO  [f.console] ,----------------------------------------------------------------------------------------------------.
|       ,---,.  .--.--.     ,----..                                     ,--,           2.10-SNAPSHOT |
|     ,'  .' | /  /    '.  /   /   \                                  ,--.'|                         |
|   ,---.'   ||  :  /`. / |   :     :  __  ,-.                   .---.|  | :               __  ,-.   |
|   |   |   .';  |  |--`  .   |  ;. /,' ,'/ /|                  /. ./|:  : '             ,' ,'/ /|   |
|   :   :  :  |  :  ;_    .   ; /--` '  | |' | ,--.--.       .-'-. ' ||  ' |      ,---.  '  | |' |   |
|   :   |  |-, \  \    `. ;   | ;    |  |   ,'/       \     /___/ \: |'  | |     /     \ |  |   ,'   |
|   |   :  ;/|  `----.   \|   : |    '  :  / .--.  .-. | .-'.. '   ' .|  | :    /    /  |'  :  /     |
|   |   |   .'  __ \  \  |.   | '___ |  | '   \__\/: . ./___/ \:     ''  : |__ .    ' / ||  | '      |
|   '   :  '   /  /`--'  /'   ; : .'|;  : |   ," .--.; |.   \  ' .\   |  | '.'|'   ;   /|;  : |      |

This file has been truncated. show original

I did notice that FSCrawler 2.10-SNAPSHOT now makes two actual Elasticsearch indices and one alias. Leveraging the alias for data views worked as intended.

The only thing I can imagine was going on with my old EC2 instance: resource (likely RAM, but maybe compute) starvation. If that was the case, it’s odd that the TesseractOCRParser didn’t fire a more descriptive exception.

Thanks for all your support and guidance, @dadoonet!

Topic		Replies	Views
I regret that I couldn't find an OCR parser to handle image/ocr-png.Please set the OCR_STRATEGY to NO_OCR or configure yourOCR parser correctly Elasticsearch	9	1415	April 10, 2022
Tif files in fscrawler Elasticsearch	25	2095	June 22, 2020
Create custom source elasticWorkplace Elastic Search elastic-workplace-search	40	2908	October 31, 2022
Read image text from pdf Elasticsearch	54	5487	June 7, 2017
FsCrawler does not do anything, does not index pfd's Elasticsearch	4	1356	March 10, 2022

FSCrawler 2.10-SNAPSHOT not indexing PDF content

SITREP 1 - 21 Jan ‘26

SITREP 2 - 22 Jan ‘26

Related topics