FScrawler does not scans the `/tmp/es`

Using Debian 10, Elasticsearch7,Java jdk 11 and FScrawler, when I run the crawler, it only index the files in /tmp/es directory at first lunch after first setup of _settings.yaml.
At first initialize, it seems good cause it index all .pdf files in the url truely.

But after first lunch (which creates indices in Elasticsearch) adding more files to url directory, is not added/seen by the crawler. Even stoppnig and restarting the fscrawler, does not results in adding/indexing new files , unless I run ./fscrawler resumes --restart that results in indexing recently added files to the url

This is _settings.yaml

---
name: "resumes"
fs:
  url: "/tmp/es"
  update_rate: "3m"
  excludes:
  - "*/~*"
  json_support: false
  filename_as_id: false
  add_filesize: true
  remove_deleted: true
  add_as_inner_object: false
  store_source: false
  index_content: true
  attributes_support: false
  raw_metadata: false
  xml_support: false
  index_folders: true
  lang_detect: false
  continue_on_error: false
  ocr:
    language: "eng"
    enabled: true
    pdf_strategy: "ocr_and_text"
  follow_symlinks: false
elasticsearch:
  nodes:
  - url: "http://192.168.225.129:9200"
  bulk_size: 100
  flush_interval: "5s"
  byte_size: "10mb"
  ssl_verification: true

fscrawler.log:

03:47:22,328 e[32mINFO e[m [f.p.e.c.f.c.BootstrapChecks] Memory [Free/Total=Percent]: HEAP [13.6mb/494mb=2.77%], RAM [178mb/1.9gb=9.03%], Swap [524.2mb/974.9mb=53.77%].
... Starting FS crawler
... FS crawler started in watch mode. It will run unless you stop it with CTRL+C.

//Many warnings about security ...

03:47:23,188 e[33mWARN e[m [o.e.c.RestClient] request [GET http://192.168.225.129:9200/] returned 1 warnings: [299 Elasticsearch-7.15.2-... "Elasticsearch built-in security features are not enabled. Without authentication, your cluster could be accessible to anyone. See https://www.elastic.co/guide/en/elasticsearch/reference/7.15/security-minimal-setup.html to enable security."]
...



...

03:47:23,605 e[32mINFO e[m [f.p.e.c.f.FsParserAbstract] FS crawler started for [resumes] for [/home/pdf] every [10s]

...



Is there any config which I have to make?

The current implementation of FSCrawler is not ideal.
It uses date comparaison to check if something changed.

Depending on the OS, if you move a file for example, then the file does not appear "as new" so FSCrawler is unable to detect it.

The --restart option basically does not care about the dates and reindex everything.

There are multiple things I'd like to support in the future:

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.