FScrawler does not scans the `/tmp/es`

ehsan_kabiri_33 · November 29, 2021, 10:04am

Using Debian 10, Elasticsearch7,Java jdk 11 and FScrawler, when I run the crawler, it only index the files in /tmp/es directory at first lunch after first setup of _settings.yaml.
At first initialize, it seems good cause it index all .pdf files in the url truely.

But after first lunch (which creates indices in Elasticsearch) adding more files to url directory, is not added/seen by the crawler. Even stoppnig and restarting the fscrawler, does not results in adding/indexing new files , unless I run ./fscrawler resumes --restart that results in indexing recently added files to the url

This is _settings.yaml

---
name: "resumes"
fs:
  url: "/tmp/es"
  update_rate: "3m"
  excludes:
  - "*/~*"
  json_support: false
  filename_as_id: false
  add_filesize: true
  remove_deleted: true
  add_as_inner_object: false
  store_source: false
  index_content: true
  attributes_support: false
  raw_metadata: false
  xml_support: false
  index_folders: true
  lang_detect: false
  continue_on_error: false
  ocr:
    language: "eng"
    enabled: true
    pdf_strategy: "ocr_and_text"
  follow_symlinks: false
elasticsearch:
  nodes:
  - url: "http://192.168.225.129:9200"
  bulk_size: 100
  flush_interval: "5s"
  byte_size: "10mb"
  ssl_verification: true

fscrawler.log:

03:47:22,328 e[32mINFO e[m [f.p.e.c.f.c.BootstrapChecks] Memory [Free/Total=Percent]: HEAP [13.6mb/494mb=2.77%], RAM [178mb/1.9gb=9.03%], Swap [524.2mb/974.9mb=53.77%].
... Starting FS crawler
... FS crawler started in watch mode. It will run unless you stop it with CTRL+C.

//Many warnings about security ...

03:47:23,188 e[33mWARN e[m [o.e.c.RestClient] request [GET http://192.168.225.129:9200/] returned 1 warnings: [299 Elasticsearch-7.15.2-... "Elasticsearch built-in security features are not enabled. Without authentication, your cluster could be accessible to anyone. See https://www.elastic.co/guide/en/elasticsearch/reference/7.15/security-minimal-setup.html to enable security."]
...



...

03:47:23,605 e[32mINFO e[m [f.p.e.c.f.FsParserAbstract] FS crawler started for [resumes] for [/home/pdf] every [10s]

...

Is there any config which I have to make?

dadoonet · November 30, 2021, 4:01pm

The current implementation of FSCrawler is not ideal.
It uses date comparaison to check if something changed.

Depending on the OS, if you move a file for example, then the file does not appear "as new" so FSCrawler is unable to detect it.

The --restart option basically does not care about the dates and reindex everything.

There are multiple things I'd like to support in the future:

Change the implementation: Use a WatchService implementation · Issue #399 · dadoonet/fscrawler · GitHub
Trigger manually a file using the REST interface: Read from any FS Provider using the REST Service · Issue #1247 · dadoonet/fscrawler · GitHub

system · December 28, 2021, 4:01pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Fscrawler does not index to ES with https Elasticsearch	4	1034	October 27, 2020
How can elasticsearch automatically index documents inside the tmp / es directory of fscrawler? Elasticsearch elastic-stack-alerting	6	619	December 19, 2019
FsCrawler does not do anything, does not index pfd's Elasticsearch	4	1249	March 10, 2022
Fscrawler only indexed 59 of a 2000 page pdf Elasticsearch	2	244	September 14, 2022
FSCrawler is not indexing consistently Elasticsearch	7	1317	April 15, 2019

FScrawler does not scans the `/tmp/es`

Related topics