ElasticSearch - fscrawler missing documents in Index

Hi All,

I am using fscrawler for indexing a file system into ES . After the sucessful run of crawler the status shows below :

 {
  "name" : "test_crawler_1",
  "lastrun" : "2017-09-14T00:24:03.808",
  "indexed" : 132560,
  "deleted" : 0
}

but in ES , I could see only 40359 records . could you please tell me what could be the issue ?

I am using ES 5.5.2 and fscrawler 2.4 version.

Please don't ping people like this.

Please format your code using </> icon as explained in this guide. It will make your post more readable.

Or use markdown style like:

```
CODE
```

I need to look at FSCrawler code about stats (I don't recall exactly) but some ideas:

  • May be you restarted FSCrawler?
  • May be there is a bug in FSCrawler which is indexing again and again the same files?
  • May be FSCrawler also records how many folders have been indexed as well?

For sure I would not take FSCrawler stats seriously. It's more an internal number (could be used for debugging purposes).

But if you can reproduce with a small scenario what is happening, I'd appreciate if you open an issue in FSCrawler project with all the issue recreation steps.

Thanks!

thanks for the reply !!

  • This is the first run of fscrawler
  • I am not sure about the bug , need to check it
  • There are 14932 folders . Stats correctly printing the number of files in the folder . Adding the folders and files also not matching .

I tried many times but the same number of files are getting indexed again.

Can you open an issue and I'll add a unit test and fix if needed (which is probably the case :slight_smile: )

I dig little deeper and found out following things :

File Count wrt to its extensions:

• 92205 htm
• 25 html
• 39936 pdf
• 394 PDF

I tried running crawler which only includes .htm files . I can see only 6 files are crawled . Others are missing . Issue with stats that it tells all are indexed.

{
  "name" : "test_crawler_2",
  "lastrun" : "2017-09-14T04:39:09.666",
  "indexed" : 92205,
  "deleted" : 0
}

I tried running with debug and trace flag.

{
  "error": {
    "root_cause": [
      {
        "type": "mapper_parsing_exception",
        "reason": "failed to parse [meta.raw.UploadDate]"
      }
    ],
    "type": "mapper_parsing_exception",
    "reason": "failed to parse [meta.raw.UploadDate]",
    "caused_by": {
      "type": "illegal_argument_exception",
      "reason": "Invalid format: \"Wed Jan 28 12:00:00 UTC 2015\""
    }
  },
  "status": 400
}

what is the suggestion here ? Should I change the meta.raw.uploaddate type to keyword/text?

Yes. Meta fields are not defined in FSCrawler mapping. I can guess that automatic mapping managed that field as a date but then failed to parse this other format.

Could you open an issue is FSCrawler? May be I can try to come with an idea...
You can also use an ingest pipeline and use a date processor to transform if needed this date before indexing it.

Thanks a lot for the analysis.

Thanks

I have created the below issue :

I guess you need to change the last run stats to detect these errors and count indexed documents and if fscrawler log these errors somewhere that will be very useful.

1 Like

Any news here?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.