FSCrawler path to remote dir

dadoonet · March 11, 2020, 2:02pm

I think there is nothing to fix but add a better documentation...

Could you try with:

name: "dev_binary"
fs:
  url: "/D:/TestData"
server:
  hostname: "host"
  protocol: "ssh"
  username: "user"
  password: "pw"
  update_rate: "15m"
  excludes:
  - "*/~*"
  json_support: false
  filename_as_id: false
  add_filesize: true
  remove_deleted: true
  add_as_inner_object: false
  store_source: false
  index_content: true
  attributes_support: false
  raw_metadata: false
  xml_support: false
  index_folders: true
  lang_detect: false
  continue_on_error: false
  ocr:
    language: "eng"
    enabled: true
    pdf_strategy: "ocr_and_text"
  follow_symlinks: false
elasticsearch:
  nodes:
  - url: "http://127.0.0.1:9200"
  bulk_size: 100
  flush_interval: "5s"
  byte_size: "10mb"

LaaKii · March 11, 2020, 2:08pm

Thank you really much! Should I update the documentation and submit a pull request?

Working perfect now.

LaaKii · March 11, 2020, 2:11pm

But somehow if I change the update_rate, it remains on 15 minutes after restart?
Is my config in the correct format? Seems like update_rate is assigned to server.

dadoonet · March 11, 2020, 2:12pm

I'm on it

dadoonet · March 11, 2020, 2:13pm

update_rate is just the pause duration between two internal runs once FSCrawler has started.

LaaKii · March 11, 2020, 2:14pm

Yes, that's what I thought too.
After setting it to 1m, should it start every 1m?

15:12:15,171 INFO  [f.p.e.c.f.FsParserAbstract] FS crawler started for [dev_binary] for [/D:/TestData] every [15m]

dadoonet · March 11, 2020, 2:16pm

It will start a new run after it has finished the current run and with a delay of 1 minute.

LaaKii · March 11, 2020, 2:17pm

Aaaaah ok. Is there a way to set a schedule on when FSCrawler should look after new / edited files?
I thought thats the way to go

dadoonet · March 11, 2020, 2:28pm

You can put FSCrawler in a crontab I guess and run it with the option --loop 1. It will exit after one run.

See https://fscrawler.readthedocs.io/en/latest/admin/cli-options.html#loop

LaaKii · March 12, 2020, 10:02am

Hey @dadoonet, I'm experiencing another issue with the url .

If the url contains a subfolder, there is a exception raised.

I got the following structure:

TestData
| - TestDataSubFolder

url: "/D:/TestData"

Exception:

   11:00:40,889 WARN  [f.p.e.c.f.FsParserAbstract] Error while crawling /D:/TestData: String index out of range: -1
11:00:40,889 WARN  [f.p.e.c.f.FsParserAbstract] Full stacktrace
java.lang.StringIndexOutOfBoundsException: String index out of range: -1
        at java.lang.String.substring(Unknown Source) ~[?:1.8.0_202]
        at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.indexDirectory(FsParserAbstract.java:541) ~[fscrawler-core-2.7-SNAPSHOT.jar:?]
        at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.addFilesRecursively(FsParserAbstract.java:289) ~[fscrawler-core-2.7-SNAPSHOT.jar:?]
        at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.run(FsParserAbstract.java:149) [fscrawler-core-2.7-SNAPSHOT.jar:?]
        at java.lang.Thread.run(Unknown Source) [?:1.8.0_202]
11:00:40,905 DEBUG [f.p.e.c.f.FsParserAbstract] Fs crawler is going to sleep for 15m

It works if there isn't a subfolder. Quite strange

Edit:
Somehow the same exception is now also suddenly beeing raised if there isn't a subfolder

doesn't work anymore...

Im quite confused. Aber looking up the code and reproduce especially this line:

String rootdir = path.substring(0, path.lastIndexOf(File.separator));

I always receive -1 for path.lastIndexOf which means the separator couldn't be find.
So I designed my String like this: \D:\TestData and lastIndexOf delivers the correct index.

Now im confused on how this worked before or if i had changed something which is now throwing errors...

dadoonet · March 12, 2020, 10:47am

Is that working now? If not, please open an issue.

LaaKii · March 12, 2020, 10:51am

I copied a old version of _status.json in to the job directory and it's working now if there are no subfolder.

Same exception get's raised if there are subfolders.
Any Idea on that?

I'll open a issue.

EDIT: seems again like a problem with the url. Just tested it local on my windows machine and it all worked fine with subdirectorys.

system · April 9, 2020, 10:51am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Fscrawler for ES clustering Elasticsearch	41	2087	March 18, 2020
FScrawler issue while crawling through a remote host Elasticsearch	3	486	June 24, 2020
Pointing FSCrawler to a separate server for documents Elasticsearch	11	2179	November 24, 2017
Indexing using FsCrawler - remote windows server Elasticsearch	2	511	June 19, 2020
FSCrawler - windows - path does not exist Elasticsearch	3	703	June 23, 2020

FSCrawler path to remote dir

Related topics