Ingesting documents (pdf, word, .txt) to elasticsearch

can I do that automatically ? without typing the whole ./fscrawler jobs --restart command ?

You mean? Reindex everything at every run?

Not really with one single command.

But, you can start fscrawler from a crontab and run something like:

bin/fscrawler job_name --restart --loop 1

It will run FSCrawler only once then exit. And anytime you restart this command it will restart from scratch for only one run.

Yes but I mean is zero administration on re-indexing. The project involves on running the fscrawler and just leave it there then employees start to dump every files every records on every folders.

its like every 15minutes I run the whole thing ./fscrawler jobs --restart command like a task scheduler bat file in windows?

BTW im on a CentOS machine

@dadoonet
Also so this FsCrawler is set to run once? because of the timestamp identifications

Exactly. But with --loop 1.

With this option, FSCrawler will run only once then exit.

I think i managed to build a solution by adding it on the crontab itself

"* * * * * bash /fscrawler.sh jobs --restart"

in this command every one minute crontab runs the script itself making the automation without needing the timestamp to be adjust.

Anyway what is the behavior of FSCrawler when --loop 1 and someone deleted a file in the directory ?

This is wrong. You will end up with a lot of processes running in parallel.

You should use:

* * * * * bash /fscrawler.sh jobs --restart --loop 1

Well. If you don't use --restart option, any time you launch again fscrawler, it should detect files which have been removed in the meantime.

If you use --restart I think that it will not detect files removal. But that's a guess as I never tested that.

loop has no effect on detection. It's only there to exit after a given number of runs.

@dadoonet

This is noted, Thank you so much! will bring updates today.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.