Fscrawler not indexing all files


I am having trouble getting fscrawler to index all my files. For example, I am trying to index all 7701 files , but only 2627 got indexed after a full day. The count stopped at 2627 today.

I took some files that were not indexed from the 7701 files and put them into another folder and ran fscrawler on the other folder and fscrawler was able to index those files.

So, I'm confused as to why fscrawler was not able to index the files in the original folder.

Other things to consider:
I am using Tesseract-OCR, so maybe it is just really slow? But, I didn't think it would stop indexing at 2627 files for over 8 hours now.

Also, because I saw it stopped indexing at 2627 files, I "restarted" it. I stopped fscrawler and ran it again with the --trace and --restart option.

(I also cleared up disk space. When I first "restarted" fscrawler, I ran into a read-only error, so I used Kibana and put "read_only_allow_delete": "false". It was able to run with the --restart option after that. I suspected it was because there wasn't a lot of space left in the drive.)

Thanks for your help!



Did it move forward?

It might be a date issue. For now, FSCrawler is comparing dates to see if a file is newer or not than the last time it ran. The --restart option indeed simply ignores the date and indexes again everything.

Yes. But if you use the auto mode it could be faster. See https://fscrawler.readthedocs.io/en/latest/user/ocr.html#ocr-pdf-strategy.

Hi! Thanks for your response.

The indexing did not move forward. I increased the heap size and that seemed to help for one of the folders I was trying to index, though.

Thanks for the auto mode tip. I think it will be helpful if I use Tesseract-OCR in the future with fscrawler.

For now, we decided to not use Tesseract-OCR and the files were indexed fine.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.