Elastic Open Crawler 0.3.0 - /var/lib/docker/overlay filling up

alongaks · June 30, 2025, 3:09pm

Hello,

I have been working with the Open Crawler for a bit, trying to tune it to replace the Enterprise Search crawler at some point.

While running crawls I began seeing the following errors in logs and the resultant document not being indexed, but due to the container filesystem becoming full:

[primary] Bulk index failed after 4 attempts: 'Failed to index documents into Elasticsearch with an>
Jun 30 10:30:49 poc.contoso.com docker[695511]: [2025-06-30T14:30:49.740Z] [crawl:6862802ff1f17972cfa02bf4] [primary] Bulk index failed for unexpected reason: No space left on device - /home/app/output/

My intended means for doing the crawls was to have the container left running at the start of the hosting VM. A scheduled crawl would be launched and begin its scheduled crawl when ready. Once the crawl is done, the container remains "up" waiting until the next crawl kicks off.

I wasn't aware these documents were being stored in the container. A good amount of the documents failing to be indexed are due to a pipeline not finding a field they need to process. However, and I think(?) the way the current Enterprise Search crawler is operating, if the field isn't present for the pipeline to do anything with it will still index the document vs. setting it aside within the crawl host.

I did notice a possible, related discussion here: Add tool to re-attempt failed bulk index payloads · Issue #66 · elastic/crawler

The question(s): Should I create some type of scripted directory purge of that container location every so often? I don't think I can efficiently handle moving those docs into the index as they are broken into multiple dirs and sub dirs.

Or is there a config for the open crawler to prevent the local offload of the failed documents?

It's possible once I get the tuning down for the crawler this may become a non-issue, but I'm curious if there may be a way I can better manage it in the meantime.

Ultimately, at least in my case, I think it would be fine for the crawler to still index a document even if the pipeline doesn't find the appropriate field to process.

Worst case I can probably have an attached disk/filesystem mounted to the /var/lib/docker/ directory and give it a good amount of space to not have to babysit it as much in the interim.

Matt_Nowzari · June 30, 2025, 3:30pm

Hey there, thank you for your question!

If you are running out of space in /home/app/output, your easiest course of action may be to map a local output/ directory to it as a volume. If you are using Open Crawler's docker-compose file, you can add a volume similar to how config and log volumes are mounted.

Add

- ./output:/home/app/output

to

github.com/elastic/crawler

docker-compose.yaml

0a48c9919


      
              volumes:
                - ./config:/home/app/config
          #      - ./logs:/home/app/logs # Enable this to access log files outside the Docker container

Open Crawler currently doesn't support fully dropping docs that fail to index.

alongaks · June 30, 2025, 3:41pm

Hey, Matt

Thanks for the quick response!

A side question - since a decent amount of my document upsert fails are due to a pipeline not seeing a field it needs, is there a way to tell the crawler to index the document anyway?

Matt_Nowzari · June 30, 2025, 4:04pm

Open Crawler currently doesn't support force-indexing in the case of pipeline failure.

However, this resource may be helpful in handling pipeline failures. You can specify additional processors or behavior if a pipeline fails - it could be helpful in reducing or eliminating the amount of dead docs in your output/ directory!

alongaks · June 30, 2025, 5:17pm

Thank you very much! Helpful as usual.

That got me further along setting the processors how to deal with failures on each.

alongaks · July 1, 2025, 12:39pm

Hi, Matt

Hope you don't mind me piggybacking an open crawler related question on this thread. I can make a seperate thread if preferred.

I'm encountering an odd behavior with content extraction - PDF or Word .doc(x).

Enterprise Search will do its best to read the content of either of these file types and index a document with read text from the original. The odd issue is the open crawler seems to be instead appending a very large hash-like content to the document when indexed. The hashed data ends up increasing the size of the document once it is indexed. If there are a decent amount of these crawled PDFs/.docs it will increase the overall index size a good amount.

This may be another case of me not knowing of some configuration nuances with the new crawler.

I have a good couple of example files below that are showing the behavior and are publicly accessible for a test. Not anything I am working with, but something to test:

pipeline_enabled: true
  pipeline: "test-search-testindex@custom"

binary_content_extraction_enabled: true
binary_content_extraction_mime_types:
  - application/pdf
  - application/msword
  - application/vnd.ms-excel
  - application/vnd.ms-powerpoint
  - application/vnd.openxmlformats-officedocument.presentationml.presentation
  - application/vnd.openxmlformats-officedocument.presentationml.template
  - application/vnd.openxmlformats-officedocument.wordprocessingml.document
  - application/vnd.openxmlformats-officedocument.wordprocessingml.template
  - application/vnd.openxmlformats-officedocument.spreadsheetml.sheet

domains:
  - url: https://gis.transportation.wv.gov
    seed_urls:
      - https://gis.transportation.wv.gov/ftp/TMA/HPMS2020data_21sub/Sample/Worksheets/screenshots.docx
      - https://gis.transportation.wv.gov/ftp/WVDOT_Transportation_GIS_Data/WVDOT_County_Code.pdf

Both of these are being indexed with the large amount of hashed data in the 'body' field. These types of documents would normally have a small amount of extracted text in the 'body' field when indexed.

Sean_Story · July 1, 2025, 3:32pm

@alongaks I see that you've set:

  pipeline: "test-search-testindex@custom"

and you were talking about Enterprise Search, which makes me guess that this pipeline was originally generated for you when you clicked a "copy and customize" button in the UI for an Enterprise Search Crawler index?

If I'm on the right track, you should have also a test-search-testindex pipeline. And that's the one you're intended to use. It will have a Pipeline Processor in the middle of it that calls the test-search-testindex@custom pipeline (so you can keep modifying the one you've been modifying). But by not using the outer pipeline, you're missing all the cleanup stuff that we try to do for you. That's where your extra huge hash is coming from. See: elasticsearch/x-pack/plugin/core/template-resources/src/main/resources/entsearch/search_default_pipeline.json at main · elastic/elasticsearch · GitHub

For a full explanation, you may want to read: Ingest pipelines for search use cases | Elastic Docs

alongaks · July 1, 2025, 3:45pm

Hey, Sean

Appreciate the response! Bam! That was it!

Thank you.

Topic		Replies	Views
Background merge hit exception Elasticsearch	10	2319	July 6, 2017
Error in bulk indexing - this IndexWriter is closed Elasticsearch	6	4052	July 6, 2017
Missing indexed documents Elasticsearch	1	300	July 6, 2017
Elasticsearch being unstable Elasticsearch	4	1502	July 6, 2017
Ingesting documents (pdf, word, .txt) to elasticsearch Elasticsearch	31	39031	March 21, 2017

Elastic Open Crawler 0.3.0 - /var/lib/docker/overlay filling up

Related topics