FSCrawler path to remote dir

I meant. Can you ssh to the machine and run that command from the SSH Client?

I did that, but I can't execute ls in the SSH Client:

user@host C:\Users\user>ls d:/TestData
'ls' is not recognized as an internal or external command,
operable program or batch file.

user@host C:\Users\user>

Oh maybe to clarify: The remote server is also a windows machine.

Oh maybe to clarify: The remote server is also a windows machine.

Yeah. I got that.
Sounds like I need to start a windows VM and add a patch to the code.

Could you open an issue in https://github.com/dadoonet/fscrawler/issues?

Could you open an issue in https://github.com/dadoonet/fscrawler/issues

Feature or Bug ? Sure I can :slight_smile:

1 Like

It's a bug to me.

I opened it. If you want me to add extra details to the issue I surely can. I mainly linked to this post here.

Can you estimate how much effort it will be to change this, or even when it will work?
Thank you so much :slight_smile:

I don't know. I first need to have a VM running with an ssh server on it. WIP.
But I'm on it. :wink:

1 Like

I think there is nothing to fix but add a better documentation...

Could you try with:

name: "dev_binary"
fs:
  url: "/D:/TestData"
server:
  hostname: "host"
  protocol: "ssh"
  username: "user"
  password: "pw"
  update_rate: "15m"
  excludes:
  - "*/~*"
  json_support: false
  filename_as_id: false
  add_filesize: true
  remove_deleted: true
  add_as_inner_object: false
  store_source: false
  index_content: true
  attributes_support: false
  raw_metadata: false
  xml_support: false
  index_folders: true
  lang_detect: false
  continue_on_error: false
  ocr:
    language: "eng"
    enabled: true
    pdf_strategy: "ocr_and_text"
  follow_symlinks: false
elasticsearch:
  nodes:
  - url: "http://127.0.0.1:9200"
  bulk_size: 100
  flush_interval: "5s"
  byte_size: "10mb"
1 Like

Thank you really much! Should I update the documentation and submit a pull request?

Working perfect now.

1 Like

But somehow if I change the update_rate, it remains on 15 minutes after restart?
Is my config in the correct format? Seems like update_rate is assigned to server.

I'm on it :wink:

update_rate is just the pause duration between two internal runs once FSCrawler has started.

Yes, that's what I thought too.
After setting it to 1m, should it start every 1m?

15:12:15,171 INFO  [f.p.e.c.f.FsParserAbstract] FS crawler started for [dev_binary] for [/D:/TestData] every [15m]

It will start a new run after it has finished the current run and with a delay of 1 minute.

1 Like

Aaaaah ok. Is there a way to set a schedule on when FSCrawler should look after new / edited files?
I thought thats the way to go :slight_smile:

You can put FSCrawler in a crontab I guess and run it with the option --loop 1. It will exit after one run.

See https://fscrawler.readthedocs.io/en/latest/admin/cli-options.html#loop

1 Like

Hey @dadoonet, I'm experiencing another issue with the url .

If the url contains a subfolder, there is a exception raised.

I got the following structure:

  • TestData
    | - TestDataSubFolder

    url: "/D:/TestData"

Exception:

   11:00:40,889 WARN  [f.p.e.c.f.FsParserAbstract] Error while crawling /D:/TestData: String index out of range: -1
11:00:40,889 WARN  [f.p.e.c.f.FsParserAbstract] Full stacktrace
java.lang.StringIndexOutOfBoundsException: String index out of range: -1
        at java.lang.String.substring(Unknown Source) ~[?:1.8.0_202]
        at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.indexDirectory(FsParserAbstract.java:541) ~[fscrawler-core-2.7-SNAPSHOT.jar:?]
        at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.addFilesRecursively(FsParserAbstract.java:289) ~[fscrawler-core-2.7-SNAPSHOT.jar:?]
        at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.run(FsParserAbstract.java:149) [fscrawler-core-2.7-SNAPSHOT.jar:?]
        at java.lang.Thread.run(Unknown Source) [?:1.8.0_202]
11:00:40,905 DEBUG [f.p.e.c.f.FsParserAbstract] Fs crawler is going to sleep for 15m

It works if there isn't a subfolder. Quite strange

Edit:
Somehow the same exception is now also suddenly beeing raised if there isn't a subfolder

doesn't work anymore...

Im quite confused. Aber looking up the code and reproduce especially this line:

String rootdir = path.substring(0, path.lastIndexOf(File.separator));

I always receive -1 for path.lastIndexOf which means the separator couldn't be find.
So I designed my String like this: \D:\TestData and lastIndexOf delivers the correct index.

Now im confused on how this worked before :open_mouth: or if i had changed something which is now throwing errors...

Is that working now? If not, please open an issue.

I copied a old version of _status.json in to the job directory and it's working now if there are no subfolder.

Same exception get's raised if there are subfolders.
Any Idea on that?

I'll open a issue.

EDIT: seems again like a problem with the url. Just tested it local on my windows machine and it all worked fine with subdirectorys.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.