FSCrawler path to remote dir

No. Sorry. My bad. I was still thinking it was json but I switched the settings to yaml.

Last question (may be :slight_smile: ). Could you run:

ls d:/TestData

Last question (may be :slight_smile: ). Could you run:

ls d:/TestData

No problems in asking, I'm happy you help me with my problem :slight_smile:

I can't execute ls commands in PuTTY, since I'm on a windows machine :confused:

I meant. Can you ssh to the machine and run that command from the SSH Client?

I did that, but I can't execute ls in the SSH Client:

user@host C:\Users\user>ls d:/TestData
'ls' is not recognized as an internal or external command,
operable program or batch file.

user@host C:\Users\user>

Oh maybe to clarify: The remote server is also a windows machine.

Oh maybe to clarify: The remote server is also a windows machine.

Yeah. I got that.
Sounds like I need to start a windows VM and add a patch to the code.

Could you open an issue in https://github.com/dadoonet/fscrawler/issues?

Could you open an issue in https://github.com/dadoonet/fscrawler/issues

Feature or Bug ? Sure I can :slight_smile:

1 Like

It's a bug to me.

I opened it. If you want me to add extra details to the issue I surely can. I mainly linked to this post here.

Can you estimate how much effort it will be to change this, or even when it will work?
Thank you so much :slight_smile:

I don't know. I first need to have a VM running with an ssh server on it. WIP.
But I'm on it. :wink:

1 Like

I think there is nothing to fix but add a better documentation...

Could you try with:

name: "dev_binary"
fs:
  url: "/D:/TestData"
server:
  hostname: "host"
  protocol: "ssh"
  username: "user"
  password: "pw"
  update_rate: "15m"
  excludes:
  - "*/~*"
  json_support: false
  filename_as_id: false
  add_filesize: true
  remove_deleted: true
  add_as_inner_object: false
  store_source: false
  index_content: true
  attributes_support: false
  raw_metadata: false
  xml_support: false
  index_folders: true
  lang_detect: false
  continue_on_error: false
  ocr:
    language: "eng"
    enabled: true
    pdf_strategy: "ocr_and_text"
  follow_symlinks: false
elasticsearch:
  nodes:
  - url: "http://127.0.0.1:9200"
  bulk_size: 100
  flush_interval: "5s"
  byte_size: "10mb"
1 Like

Thank you really much! Should I update the documentation and submit a pull request?

Working perfect now.

1 Like

But somehow if I change the update_rate, it remains on 15 minutes after restart?
Is my config in the correct format? Seems like update_rate is assigned to server.

I'm on it :wink:

update_rate is just the pause duration between two internal runs once FSCrawler has started.

Yes, that's what I thought too.
After setting it to 1m, should it start every 1m?

15:12:15,171 INFO  [f.p.e.c.f.FsParserAbstract] FS crawler started for [dev_binary] for [/D:/TestData] every [15m]

It will start a new run after it has finished the current run and with a delay of 1 minute.

1 Like

Aaaaah ok. Is there a way to set a schedule on when FSCrawler should look after new / edited files?
I thought thats the way to go :slight_smile:

You can put FSCrawler in a crontab I guess and run it with the option --loop 1. It will exit after one run.

See https://fscrawler.readthedocs.io/en/latest/admin/cli-options.html#loop

1 Like

Hey @dadoonet, I'm experiencing another issue with the url .

If the url contains a subfolder, there is a exception raised.

I got the following structure:

  • TestData
    | - TestDataSubFolder

    url: "/D:/TestData"

Exception:

   11:00:40,889 WARN  [f.p.e.c.f.FsParserAbstract] Error while crawling /D:/TestData: String index out of range: -1
11:00:40,889 WARN  [f.p.e.c.f.FsParserAbstract] Full stacktrace
java.lang.StringIndexOutOfBoundsException: String index out of range: -1
        at java.lang.String.substring(Unknown Source) ~[?:1.8.0_202]
        at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.indexDirectory(FsParserAbstract.java:541) ~[fscrawler-core-2.7-SNAPSHOT.jar:?]
        at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.addFilesRecursively(FsParserAbstract.java:289) ~[fscrawler-core-2.7-SNAPSHOT.jar:?]
        at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.run(FsParserAbstract.java:149) [fscrawler-core-2.7-SNAPSHOT.jar:?]
        at java.lang.Thread.run(Unknown Source) [?:1.8.0_202]
11:00:40,905 DEBUG [f.p.e.c.f.FsParserAbstract] Fs crawler is going to sleep for 15m

It works if there isn't a subfolder. Quite strange

Edit:
Somehow the same exception is now also suddenly beeing raised if there isn't a subfolder

doesn't work anymore...

Im quite confused. Aber looking up the code and reproduce especially this line:

String rootdir = path.substring(0, path.lastIndexOf(File.separator));

I always receive -1 for path.lastIndexOf which means the separator couldn't be find.
So I designed my String like this: \D:\TestData and lastIndexOf delivers the correct index.

Now im confused on how this worked before :open_mouth: or if i had changed something which is now throwing errors...

Is that working now? If not, please open an issue.