Hello,
I have been working with the Open Crawler for a bit, trying to tune it to replace the Enterprise Search crawler at some point.
While running crawls I began seeing the following errors in logs and the resultant document not being indexed, but due to the container filesystem becoming full:
[primary] Bulk index failed after 4 attempts: 'Failed to index documents into Elasticsearch with an>
Jun 30 10:30:49 poc.contoso.com docker[695511]: [2025-06-30T14:30:49.740Z] [crawl:6862802ff1f17972cfa02bf4] [primary] Bulk index failed for unexpected reason: No space left on device - /home/app/output/
My intended means for doing the crawls was to have the container left running at the start of the hosting VM. A scheduled crawl would be launched and begin its scheduled crawl when ready. Once the crawl is done, the container remains "up" waiting until the next crawl kicks off.
I wasn't aware these documents were being stored in the container. A good amount of the documents failing to be indexed are due to a pipeline not finding a field they need to process. However, and I think(?) the way the current Enterprise Search crawler is operating, if the field isn't present for the pipeline to do anything with it will still index the document vs. setting it aside within the crawl host.
I did notice a possible, related discussion here: Add tool to re-attempt failed bulk index payloads · Issue #66 · elastic/crawler
The question(s): Should I create some type of scripted directory purge of that container location every so often? I don't think I can efficiently handle moving those docs into the index as they are broken into multiple dirs and sub dirs.
Or is there a config for the open crawler to prevent the local offload of the failed documents?
It's possible once I get the tuning down for the crawler this may become a non-issue, but I'm curious if there may be a way I can better manage it in the meantime.
Ultimately, at least in my case, I think it would be fine for the crawler to still index a document even if the pipeline doesn't find the appropriate field to process.
Worst case I can probably have an attached disk/filesystem mounted to the /var/lib/docker/
directory and give it a good amount of space to not have to babysit it as much in the interim.