ElasticSearch - fscrawler missing documents in Index

Suresh_Nataraj · September 14, 2017, 4:51am

Hi All,

I am using fscrawler for indexing a file system into ES . After the sucessful run of crawler the status shows below :

 {
  "name" : "test_crawler_1",
  "lastrun" : "2017-09-14T00:24:03.808",
  "indexed" : 132560,
  "deleted" : 0
}

but in ES , I could see only 40359 records . could you please tell me what could be the issue ?

I am using ES 5.5.2 and fscrawler 2.4 version.

dadoonet · September 14, 2017, 6:19am

Please don't ping people like this.

Please format your code using </> icon as explained in this guide. It will make your post more readable.

Or use markdown style like:

```
CODE
```

I need to look at FSCrawler code about stats (I don't recall exactly) but some ideas:

May be you restarted FSCrawler?
May be there is a bug in FSCrawler which is indexing again and again the same files?
May be FSCrawler also records how many folders have been indexed as well?

For sure I would not take FSCrawler stats seriously. It's more an internal number (could be used for debugging purposes).

But if you can reproduce with a small scenario what is happening, I'd appreciate if you open an issue in FSCrawler project with all the issue recreation steps.

Thanks!

Suresh_Nataraj · September 14, 2017, 6:48am

thanks for the reply !!

This is the first run of fscrawler
I am not sure about the bug , need to check it
There are 14932 folders . Stats correctly printing the number of files in the folder . Adding the folders and files also not matching .

I tried many times but the same number of files are getting indexed again.

dadoonet · September 14, 2017, 7:15am

Can you open an issue and I'll add a unit test and fix if needed (which is probably the case )

Suresh_Nataraj · September 14, 2017, 9:21am

I dig little deeper and found out following things :

File Count wrt to its extensions:

• 92205 htm
• 25 html
• 39936 pdf
• 394 PDF

I tried running crawler which only includes .htm files . I can see only 6 files are crawled . Others are missing . Issue with stats that it tells all are indexed.

{
  "name" : "test_crawler_2",
  "lastrun" : "2017-09-14T04:39:09.666",
  "indexed" : 92205,
  "deleted" : 0
}

I tried running with debug and trace flag.

{
  "error": {
    "root_cause": [
      {
        "type": "mapper_parsing_exception",
        "reason": "failed to parse [meta.raw.UploadDate]"
      }
    ],
    "type": "mapper_parsing_exception",
    "reason": "failed to parse [meta.raw.UploadDate]",
    "caused_by": {
      "type": "illegal_argument_exception",
      "reason": "Invalid format: \"Wed Jan 28 12:00:00 UTC 2015\""
    }
  },
  "status": 400
}

what is the suggestion here ? Should I change the meta.raw.uploaddate type to keyword/text?

dadoonet · September 14, 2017, 9:52am

Yes. Meta fields are not defined in FSCrawler mapping. I can guess that automatic mapping managed that field as a date but then failed to parse this other format.

Could you open an issue is FSCrawler? May be I can try to come with an idea...
You can also use an ingest pipeline and use a date processor to transform if needed this date before indexing it.

Thanks a lot for the analysis.

Suresh_Nataraj · September 14, 2017, 10:17am

Thanks

I have created the below issue :

I guess you need to change the last run stats to detect these errors and count indexed documents and if fscrawler log these errors somewhere that will be very useful.

Quix0r · October 2, 2017, 9:29pm

Any news here?

system · October 30, 2017, 9:29pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
FSCrawler is not indexing consistently Elasticsearch	7	1317	April 15, 2019
Error while indexing documents into ES using Fscrawler Elasticsearch	6	2599	December 9, 2018
FScrawler does not scans the `/tmp/es` Elasticsearch	2	532	December 28, 2021
Fscrawler not indexing all files Elasticsearch	4	1079	July 24, 2020
FSCrawler large document and indexing based on content Elasticsearch	4	2374	December 28, 2017

ElasticSearch - fscrawler missing documents in Index

Related topics