FSCrawler is not indexing consistently


#1

I have configured fscrawler to monitor a folder with 13 items (folders, files of types txt, pdf, doc, xls & ppt) But not all files (and their content) are being indexed. Only 8 are being indexed. Is there an error in the configuration?
Elastic Search & FSCrawler versions

    {
  "name" : "d6uoHrq",
  "cluster_name" : "elasticsearch",
  "cluster_uuid" : "5X293rOKTo-0tqWdEPESGQ",
  "version" : {
    "number" : "6.6.2",
    "build_flavor" : "default",
    "build_type" : "deb",
    "build_hash" : "3bd3e59",
    "build_date" : "2019-03-06T15:16:26.864148Z",
    "build_snapshot" : false,
    "lucene_version" : "7.6.0",
    "minimum_wire_compatibility_version" : "5.6.0",
    "minimum_index_compatibility_version" : "5.0.0"
  },
  "tagline" : "You Know, for Search" 

 "version" : "2.7-SNAPSHOT",
   "ok" : true,
   "elasticsearch" : "6.6.2"

Configuration: Changed from the default

        name: "test"
fs:
  url: "/home/test"
  update_rate: "2m"

Contents of the directory

    /home/test# ls -lrRt
.:
total 248
-rw-r--r-- 1 root root 15581 Mar 10 09:27 testdoc.docx
-rw-r--r-- 1 root root 15581 Mar 10 09:27 burnsfirstaid.docx
-rw-r--r-- 1 root root   226 Mar 26 06:33 rhyme
-rw-r--r-- 1 root root 54315 Mar 27 04:24 hickory.pdf
-rw-r--r-- 1 root root   108 Mar 28 04:18 twinkle.txt
-rw-r--r-- 1 root root  8617 Mar 28 06:10 weeks.xlsx
-rw-r--r-- 1 root root 33135 Mar 28 06:22 Testppt.pptx
-rw-r--r-- 1 root root  9373 Mar 28 06:27 months.xlsx
-rw-r--r-- 1 root root 28642 Mar 28 06:42 twinkle.pdf
-rw-r--r-- 1 root root 34433 Mar 28 06:43 Testppt2.pptx
-rw-r--r-- 1 root root 19331 Mar 28 06:44 computerpractise.docx
drwxr-xr-x 2 root root  4096 Mar 28 07:05 test2
drwxr-xr-x 3 root root  4096 Mar 28 07:12 test1

./test2:
total 36
-rw-r--r-- 1 root root 34433 Mar 28 06:43 Testppt2.pptx

./test1:
total 24
-rw-r--r-- 1 root root 19331 Mar 28 06:44 computerpractise.docx
drwxr-xr-x 2 root root  4096 Mar 28 07:12 test11

./test1/test11:
total 0

Indices in elasticsearch

health status index                      uuid                   pri rep docs.count docs.deleted store.size pri.store.size
yellow open   test                       D0LQH7nZRLuGBzCFiLKQUg   1   1          8            0    198.5kb        198.5kb
yellow open   test_folder                2aZtjtKETkG-JR39myuhQw   5   1          4            0     
16.8kb         16.8kb

Log file
fscrawler log file


(David Pilato) #2

I don't see FSCrawler logs. Could you share them on gist.github.com?
Anything strange in logs?


#3

Sorry. I had missed adding the link to the log file. Now I updated the link in the original post. No, there are no warnings or errors in the log file.
I deleted this job and index. And created a new job (& index) and I see similar behavior.
I can upload all the files using the RESTApi's upload option. But the fscrawler on its own ignores (does not index) some files.


(David Pilato) #4

Could you run it again with the --restart option?


#5

I ran fscrawler with -restart option and all the files (that are present at the start) are indexed. If I add files after the start, they are not indexed though the fscrawler wakes up after the update interval.


(David Pilato) #6

The current implementation works with dates.
It depends on the OS.
On some OS moving a file to the scanned dir is not going to change any file date so the file won't be picked up by FSCrawler.
You need to "touch" the file.


#7

OK. Understand now. Thanks David for the timely response.


(system) closed #8

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.