FSCrawler path to remote dir

I think there is nothing to fix but add a better documentation...

Could you try with:

name: "dev_binary"
fs:
  url: "/D:/TestData"
server:
  hostname: "host"
  protocol: "ssh"
  username: "user"
  password: "pw"
  update_rate: "15m"
  excludes:
  - "*/~*"
  json_support: false
  filename_as_id: false
  add_filesize: true
  remove_deleted: true
  add_as_inner_object: false
  store_source: false
  index_content: true
  attributes_support: false
  raw_metadata: false
  xml_support: false
  index_folders: true
  lang_detect: false
  continue_on_error: false
  ocr:
    language: "eng"
    enabled: true
    pdf_strategy: "ocr_and_text"
  follow_symlinks: false
elasticsearch:
  nodes:
  - url: "http://127.0.0.1:9200"
  bulk_size: 100
  flush_interval: "5s"
  byte_size: "10mb"
1 Like

Thank you really much! Should I update the documentation and submit a pull request?

Working perfect now.

1 Like

But somehow if I change the update_rate, it remains on 15 minutes after restart?
Is my config in the correct format? Seems like update_rate is assigned to server.

I'm on it :wink:

update_rate is just the pause duration between two internal runs once FSCrawler has started.

Yes, that's what I thought too.
After setting it to 1m, should it start every 1m?

15:12:15,171 INFO  [f.p.e.c.f.FsParserAbstract] FS crawler started for [dev_binary] for [/D:/TestData] every [15m]

It will start a new run after it has finished the current run and with a delay of 1 minute.

1 Like

Aaaaah ok. Is there a way to set a schedule on when FSCrawler should look after new / edited files?
I thought thats the way to go :slight_smile:

You can put FSCrawler in a crontab I guess and run it with the option --loop 1. It will exit after one run.

See https://fscrawler.readthedocs.io/en/latest/admin/cli-options.html#loop

1 Like

Hey @dadoonet, I'm experiencing another issue with the url .

If the url contains a subfolder, there is a exception raised.

I got the following structure:

  • TestData
    | - TestDataSubFolder

    url: "/D:/TestData"

Exception:

   11:00:40,889 WARN  [f.p.e.c.f.FsParserAbstract] Error while crawling /D:/TestData: String index out of range: -1
11:00:40,889 WARN  [f.p.e.c.f.FsParserAbstract] Full stacktrace
java.lang.StringIndexOutOfBoundsException: String index out of range: -1
        at java.lang.String.substring(Unknown Source) ~[?:1.8.0_202]
        at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.indexDirectory(FsParserAbstract.java:541) ~[fscrawler-core-2.7-SNAPSHOT.jar:?]
        at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.addFilesRecursively(FsParserAbstract.java:289) ~[fscrawler-core-2.7-SNAPSHOT.jar:?]
        at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.run(FsParserAbstract.java:149) [fscrawler-core-2.7-SNAPSHOT.jar:?]
        at java.lang.Thread.run(Unknown Source) [?:1.8.0_202]
11:00:40,905 DEBUG [f.p.e.c.f.FsParserAbstract] Fs crawler is going to sleep for 15m

It works if there isn't a subfolder. Quite strange

Edit:
Somehow the same exception is now also suddenly beeing raised if there isn't a subfolder

doesn't work anymore...

Im quite confused. Aber looking up the code and reproduce especially this line:

String rootdir = path.substring(0, path.lastIndexOf(File.separator));

I always receive -1 for path.lastIndexOf which means the separator couldn't be find.
So I designed my String like this: \D:\TestData and lastIndexOf delivers the correct index.

Now im confused on how this worked before :open_mouth: or if i had changed something which is now throwing errors...

Is that working now? If not, please open an issue.

I copied a old version of _status.json in to the job directory and it's working now if there are no subfolder.

Same exception get's raised if there are subfolders.
Any Idea on that?

I'll open a issue.

EDIT: seems again like a problem with the url. Just tested it local on my windows machine and it all worked fine with subdirectorys.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.