Please format your code using </> icon as explained in this guide. It will make your post more readable.
Or use markdown style like:
```
CODE
```
I need to look at FSCrawler code about stats (I don't recall exactly) but some ideas:
May be you restarted FSCrawler?
May be there is a bug in FSCrawler which is indexing again and again the same files?
May be FSCrawler also records how many folders have been indexed as well?
For sure I would not take FSCrawler stats seriously. It's more an internal number (could be used for debugging purposes).
But if you can reproduce with a small scenario what is happening, I'd appreciate if you open an issue in FSCrawler project with all the issue recreation steps.
I dig little deeper and found out following things :
File Count wrt to its extensions:
• 92205 htm
• 25 html
• 39936 pdf
• 394 PDF
I tried running crawler which only includes .htm files . I can see only 6 files are crawled . Others are missing . Issue with stats that it tells all are indexed.
Yes. Meta fields are not defined in FSCrawler mapping. I can guess that automatic mapping managed that field as a date but then failed to parse this other format.
Could you open an issue is FSCrawler? May be I can try to come with an idea...
You can also use an ingest pipeline and use a date processor to transform if needed this date before indexing it.
I guess you need to change the last run stats to detect these errors and count indexed documents and if fscrawler log these errors somewhere that will be very useful.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.