ElasticSearch - fscrawler missing documents in Index


(Suresh Nataraj) #1

Hi All,

I am using fscrawler for indexing a file system into ES . After the sucessful run of crawler the status shows below :

 {
  "name" : "test_crawler_1",
  "lastrun" : "2017-09-14T00:24:03.808",
  "indexed" : 132560,
  "deleted" : 0
}

but in ES , I could see only 40359 records . could you please tell me what could be the issue ?

I am using ES 5.5.2 and fscrawler 2.4 version.


(David Pilato) #2

Please don't ping people like this.

Please format your code using </> icon as explained in this guide. It will make your post more readable.

Or use markdown style like:

```
CODE
```

I need to look at FSCrawler code about stats (I don't recall exactly) but some ideas:

  • May be you restarted FSCrawler?
  • May be there is a bug in FSCrawler which is indexing again and again the same files?
  • May be FSCrawler also records how many folders have been indexed as well?

For sure I would not take FSCrawler stats seriously. It's more an internal number (could be used for debugging purposes).

But if you can reproduce with a small scenario what is happening, I'd appreciate if you open an issue in FSCrawler project with all the issue recreation steps.

Thanks!


(Suresh Nataraj) #3

thanks for the reply !!

  • This is the first run of fscrawler
  • I am not sure about the bug , need to check it
  • There are 14932 folders . Stats correctly printing the number of files in the folder . Adding the folders and files also not matching .

I tried many times but the same number of files are getting indexed again.


(David Pilato) #4

Can you open an issue and I'll add a unit test and fix if needed (which is probably the case :slight_smile: )


(Suresh Nataraj) #5

I dig little deeper and found out following things :

File Count wrt to its extensions:

• 92205 htm
• 25 html
• 39936 pdf
• 394 PDF

I tried running crawler which only includes .htm files . I can see only 6 files are crawled . Others are missing . Issue with stats that it tells all are indexed.

{
  "name" : "test_crawler_2",
  "lastrun" : "2017-09-14T04:39:09.666",
  "indexed" : 92205,
  "deleted" : 0
}

I tried running with debug and trace flag.

{
  "error": {
    "root_cause": [
      {
        "type": "mapper_parsing_exception",
        "reason": "failed to parse [meta.raw.UploadDate]"
      }
    ],
    "type": "mapper_parsing_exception",
    "reason": "failed to parse [meta.raw.UploadDate]",
    "caused_by": {
      "type": "illegal_argument_exception",
      "reason": "Invalid format: \"Wed Jan 28 12:00:00 UTC 2015\""
    }
  },
  "status": 400
}

what is the suggestion here ? Should I change the meta.raw.uploaddate type to keyword/text?


(David Pilato) #6

Yes. Meta fields are not defined in FSCrawler mapping. I can guess that automatic mapping managed that field as a date but then failed to parse this other format.

Could you open an issue is FSCrawler? May be I can try to come with an idea...
You can also use an ingest pipeline and use a date processor to transform if needed this date before indexing it.

Thanks a lot for the analysis.


(Suresh Nataraj) #7

Thanks

I have created the below issue :

I guess you need to change the last run stats to detect these errors and count indexed documents and if fscrawler log these errors somewhere that will be very useful.


(Roland Häder) #8

Any news here?


(system) #9

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.