Ingesting documents (pdf, word, .txt) to elasticsearch

Mark_Dendrix_Garcia · February 17, 2017, 9:18am

can I do that automatically ? without typing the whole ./fscrawler jobs --restart command ?

dadoonet · February 17, 2017, 9:47am

You mean? Reindex everything at every run?

Not really with one single command.

But, you can start fscrawler from a crontab and run something like:

bin/fscrawler job_name --restart --loop 1

It will run FSCrawler only once then exit. And anytime you restart this command it will restart from scratch for only one run.

Mark_Dendrix_Garcia · February 17, 2017, 10:15am

Yes but I mean is zero administration on re-indexing. The project involves on running the fscrawler and just leave it there then employees start to dump every files every records on every folders.

Mark_Dendrix_Garcia · February 17, 2017, 10:16am

its like every 15minutes I run the whole thing ./fscrawler jobs --restart command like a task scheduler bat file in windows?

Mark_Dendrix_Garcia · February 17, 2017, 10:16am

BTW im on a CentOS machine

Mark_Dendrix_Garcia · February 17, 2017, 10:31am

@dadoonet
Also so this FsCrawler is set to run once? because of the timestamp identifications

dadoonet · February 17, 2017, 11:01am

Exactly. But with --loop 1.

With this option, FSCrawler will run only once then exit.

Mark_Dendrix_Garcia · February 17, 2017, 11:04am

I think i managed to build a solution by adding it on the crontab itself

"* * * * * bash /fscrawler.sh jobs --restart"

in this command every one minute crontab runs the script itself making the automation without needing the timestamp to be adjust.

Mark_Dendrix_Garcia · February 17, 2017, 11:06am

Anyway what is the behavior of FSCrawler when --loop 1 and someone deleted a file in the directory ?

dadoonet · February 17, 2017, 11:21am

This is wrong. You will end up with a lot of processes running in parallel.

You should use:

* * * * * bash /fscrawler.sh jobs --restart --loop 1

Well. If you don't use --restart option, any time you launch again fscrawler, it should detect files which have been removed in the meantime.

If you use --restart I think that it will not detect files removal. But that's a guess as I never tested that.

loop has no effect on detection. It's only there to exit after a given number of runs.

Mark_Dendrix_Garcia · February 21, 2017, 1:09am

@dadoonet

This is noted, Thank you so much! will bring updates today.

system · March 21, 2017, 1:10am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Indexing word, pdf documents? Elasticsearch	12	6012	July 7, 2020
How can I ingest PDF and words files and extract keywords of these documents? Elasticsearch	8	3847	June 26, 2018
Ingest pdf/doc/ppt files from HDFS to Elastic Search. Fscrawler vs es-hadoop Elasticsearch es-hadoop	2	1866	January 10, 2018
Searching for content in pdf and word documents Kibana	7	1882	August 30, 2020
Ingest Office documents using FileBeat, LogStash pipeline Elasticsearch	2	813	June 8, 2017

Ingesting documents (pdf, word, .txt) to elasticsearch

Related topics