If FSCrawler stops mid-scan, will it restart from scratch?

syoo · January 28, 2025, 6:25pm

As a newcomer to Elasticsearch, I want to index a large filesystem (100TB) with FSCrawler.

I’ve heard that if the initial scan is interrupted, FSCrawler might re-scan everything from the beginning.

What’s the best strategy to handle this for such a massive dataset?

Are there checkpointing/resume features, configuration tweaks, or alternative workflows to avoid redundant work?

dadoonet · January 28, 2025, 9:04pm

Welcome!

This is correct and until a change is made the only workaround I can see is to start multiple fscrawler instances, one per dir in the root directory.

syoo · January 28, 2025, 9:27pm

Thank you for your support. Then once I scan all the files using multiple FSCrawler instances, will running FSCrawler on the root directory continue tracking changes without performing a full re-scan?

dadoonet · January 29, 2025, 1:15am

I think you will need to keep it running as it was running for the first run.
I said "I think" because I'm not sure about it. I don't remember how ids are computed.

May be you could run it from the root and use different includes settings for each instance. And then run again from root without the includes setting... That way, all ids should be consistent.

Topic		Replies	Views
If fscrawler is stopped before finishing an index, when it restarts, will it start from the beginning? Elasticsearch	5	517	October 25, 2019
FScrawler does not scans the `/tmp/es` Elasticsearch	2	555	December 28, 2021
Fscrawler not indexing all files Elasticsearch	4	1141	July 24, 2020
FSCrawler is not indexing consistently Elasticsearch	7	1342	April 15, 2019
Can we index incremental data for files using FSCrawler? Elasticsearch	12	869	August 30, 2019

If FSCrawler stops mid-scan, will it restart from scratch?

Related topics