Ingesting documents (pdf, word, .txt) to elasticsearch


(David Pilato) #12

Please run it with debug option and share your logs and config file.


(Mark Dendrix Garcia) #13

Were can I see those logs ?

or should i just copy paste the logs on the screen ??


(Mark Dendrix Garcia) #14

@dadoonet first log without putting any new documents

16:07:47,322 DEBUG [f.p.e.c.f.FsCrawlerImpl] Fs crawler is now waking up again...
16:07:47,323 DEBUG [f.p.e.c.f.FsCrawlerImpl] Fs crawler thread [jobs] is now running. Run #2...
16:07:47,336 DEBUG [f.p.e.c.f.FsCrawlerImpl] indexing [/mapr/my.cluster.com/vm1] content
16:07:47,336 DEBUG [f.p.e.c.f.f.FileAbstractor] Listing local files from /mapr/my.cluster.com/vm1
16:07:47,338 DEBUG [f.p.e.c.f.f.FileAbstractor] 1 local files found
16:07:47,338 DEBUG [f.p.e.c.f.u.FsCrawlerUtil] filename = [changelog.txt], includes = [null], excludes = [[~*]]
16:07:47,339 DEBUG [f.p.e.c.f.u.FsCrawlerUtil] filename = [changelog.txt], excludes = [[~*]]
16:07:47,339 DEBUG [f.p.e.c.f.u.FsCrawlerUtil] filename = [changelog.txt], includes = [null]
16:07:47,339 DEBUG [f.p.e.c.f.FsCrawlerImpl] [changelog.txt] can be indexed: [true]
16:07:47,339 DEBUG [f.p.e.c.f.FsCrawlerImpl]   - file: changelog.txt
16:07:47,339 DEBUG [f.p.e.c.f.FsCrawlerImpl]     - not modified: creation date 2016-12-12T13:48:24 , file date 2016-12-12T13:48:24, last scan date 2017-02-17T16:06:44.904
16:07:47,340 DEBUG [f.p.e.c.f.FsCrawlerImpl] Looking for removed files in [/mapr/my.cluster.com/vm1]...
16:07:47,340 DEBUG [f.p.e.c.f.c.ElasticsearchClient] search [jobs]/[doc], request [SearchRequest{query=path.encoded:66e1f91ce6a0761b736dbb8117e542e, fields=[_source, file.filename], size=10000}]
16:07:47,348 DEBUG [f.p.e.c.f.u.FsCrawlerUtil] filename = [changelog.txt], includes = [null], excludes = [[~*]]
16:07:47,348 DEBUG [f.p.e.c.f.u.FsCrawlerUtil] filename = [changelog.txt], excludes = [[~*]]
16:07:47,349 DEBUG [f.p.e.c.f.u.FsCrawlerUtil] filename = [changelog.txt], includes = [null]
16:07:47,349 DEBUG [f.p.e.c.f.FsCrawlerImpl] Looking for removed directories in [/mapr/my.cluster.com/vm1]...
16:07:47,349 DEBUG [f.p.e.c.f.c.ElasticsearchClient] search [jobs]/[folder], request [SearchRequest{query=encoded:66e1f91ce6a0761b736dbb8117e542e, fields=[], size=10000}]
16:07:47,352 DEBUG [f.p.e.c.f.u.FsCrawlerUtil] filename = [/mapr/my.cluster.com/vm1], includes = [null], excludes = [[~*]]
16:07:47,353 DEBUG [f.p.e.c.f.u.FsCrawlerUtil] filename = [/mapr/my.cluster.com/vm1], excludes = [[~*]]
16:07:47,353 DEBUG [f.p.e.c.f.u.FsCrawlerUtil] filename = [/mapr/my.cluster.com/vm1], includes = [null]
16:07:47,353 DEBUG [f.p.e.c.f.FsCrawlerImpl] Delete folder /mapr/my.cluster.com/vm1//mapr/my.cluster.com/vm1
16:07:47,353 DEBUG [f.p.e.c.f.c.ElasticsearchClient] search [jobs]/[doc], request [SearchRequest{query=path.encoded:eec8ebf874bb5f54d44cace29601ed0, fields=[_source, file.filename], size=10000}]
16:07:47,357 DEBUG [f.p.e.c.f.c.ElasticsearchClient] search [jobs]/[folder], request [SearchRequest{query=encoded:eec8ebf874bb5f54d44cace29601ed0, fields=[], size=10000}]
16:07:47,360 DEBUG [f.p.e.c.f.FsCrawlerImpl] Deleting from ES jobs, folder, eec8ebf874bb5f54d44cace29601ed0
16:07:47,361 DEBUG [f.p.e.c.f.c.BulkProcessor] {"delete":{"_index":"jobs","_type":"folder","_id":"eec8ebf874bb5f54d44cace29601ed0"}}
16:07:47,362 DEBUG [f.p.e.c.f.FsCrawlerImpl] Fs crawler is going to sleep for 1m
16:07:51,907 DEBUG [f.p.e.c.f.c.BulkProcessor] Going to execute new bulk composed of 1 actions
16:07:51,913 DEBUG [f.p.e.c.f.c.ElasticsearchClient] bulk response: BulkResponse{items=[BulkItemTopLevelResponse{index=null, delete=BulkItemResponse{failed=false, index='jobs', type='folder', id='eec8ebf874bb5f54d44cace29601ed0', opType=null, failureMessage='null'}}]}
16:07:51,913 DEBUG [f.p.e.c.f.c.BulkProcessor] Executed bulk composed of 1 actions

(Mark Dendrix Garcia) #15

@dadoonet 2nd log after adding new document (bago5.pdf)

16:09:47,394 DEBUG [f.p.e.c.f.FsCrawlerImpl] Fs crawler is now waking up again...
16:09:47,396 DEBUG [f.p.e.c.f.FsCrawlerImpl] Fs crawler thread [jobs] is now running. Run #4...
16:09:47,401 DEBUG [f.p.e.c.f.FsCrawlerImpl] indexing [/mapr/my.cluster.com/vm1] content
16:09:47,402 DEBUG [f.p.e.c.f.f.FileAbstractor] Listing local files from /mapr/my.cluster.com/vm1
16:09:47,404 DEBUG [f.p.e.c.f.f.FileAbstractor] 2 local files found
16:09:47,405 DEBUG [f.p.e.c.f.u.FsCrawlerUtil] filename = [changelog.txt], includes = [null], excludes = [[~*]]
16:09:47,405 DEBUG [f.p.e.c.f.u.FsCrawlerUtil] filename = [changelog.txt], excludes = [[~*]]
16:09:47,405 DEBUG [f.p.e.c.f.u.FsCrawlerUtil] filename = [changelog.txt], includes = [null]
16:09:47,405 DEBUG [f.p.e.c.f.FsCrawlerImpl] [changelog.txt] can be indexed: [true]
16:09:47,405 DEBUG [f.p.e.c.f.FsCrawlerImpl]   - file: changelog.txt
16:09:47,406 DEBUG [f.p.e.c.f.FsCrawlerImpl]     - not modified: creation date 2016-12-12T13:48:24 , file date 2016-12-12T13:48:24, last scan date 2017-02-17T16:08:45.365
16:09:47,406 DEBUG [f.p.e.c.f.u.FsCrawlerUtil] filename = [bago5.pdf], includes = [null], excludes = [[~*]]
16:09:47,406 DEBUG [f.p.e.c.f.u.FsCrawlerUtil] filename = [bago5.pdf], excludes = [[~*]]
16:09:47,406 DEBUG [f.p.e.c.f.u.FsCrawlerUtil] filename = [bago5.pdf], includes = [null]
16:09:47,406 DEBUG [f.p.e.c.f.FsCrawlerImpl] [bago5.pdf] can be indexed: [true]
16:09:47,406 DEBUG [f.p.e.c.f.FsCrawlerImpl]   - file: bago5.pdf
16:09:47,406 DEBUG [f.p.e.c.f.FsCrawlerImpl]     - not modified: creation date 2017-02-16T11:08:43 , file date 2017-02-16T11:08:43, last scan date 2017-02-17T16:08:45.365
16:09:47,406 DEBUG [f.p.e.c.f.FsCrawlerImpl] Looking for removed files in [/mapr/my.cluster.com/vm1]...
16:09:47,407 DEBUG [f.p.e.c.f.c.ElasticsearchClient] search [jobs]/[doc], request [SearchRequest{query=path.encoded:66e1f91ce6a0761b736dbb8117e542e, fields=[_source, file.filename], size=10000}]
16:09:47,414 DEBUG [f.p.e.c.f.u.FsCrawlerUtil] filename = [changelog.txt], includes = [null], excludes = [[~*]]
16:09:47,414 DEBUG [f.p.e.c.f.u.FsCrawlerUtil] filename = [changelog.txt], excludes = [[~*]]
16:09:47,415 DEBUG [f.p.e.c.f.u.FsCrawlerUtil] filename = [changelog.txt], includes = [null]
16:09:47,415 DEBUG [f.p.e.c.f.FsCrawlerImpl] Looking for removed directories in [/mapr/my.cluster.com/vm1]...
16:09:47,415 DEBUG [f.p.e.c.f.c.ElasticsearchClient] search [jobs]/[folder], request [SearchRequest{query=encoded:66e1f91ce6a0761b736dbb8117e542e, fields=[], size=10000}]
16:09:47,420 DEBUG [f.p.e.c.f.u.FsCrawlerUtil] filename = [/mapr/my.cluster.com/vm1], includes = [null], excludes = [[~*]]
16:09:47,420 DEBUG [f.p.e.c.f.u.FsCrawlerUtil] filename = [/mapr/my.cluster.com/vm1], excludes = [[~*]]
16:09:47,420 DEBUG [f.p.e.c.f.u.FsCrawlerUtil] filename = [/mapr/my.cluster.com/vm1], includes = [null]
16:09:47,420 DEBUG [f.p.e.c.f.FsCrawlerImpl] Delete folder /mapr/my.cluster.com/vm1//mapr/my.cluster.com/vm1
16:09:47,421 DEBUG [f.p.e.c.f.c.ElasticsearchClient] search [jobs]/[doc], request [SearchRequest{query=path.encoded:eec8ebf874bb5f54d44cace29601ed0, fields=[_source, file.filename], size=10000}]
16:09:47,424 DEBUG [f.p.e.c.f.c.ElasticsearchClient] search [jobs]/[folder], request [SearchRequest{query=encoded:eec8ebf874bb5f54d44cace29601ed0, fields=[], size=10000}]
16:09:47,427 DEBUG [f.p.e.c.f.FsCrawlerImpl] Deleting from ES jobs, folder, eec8ebf874bb5f54d44cace29601ed0
16:09:47,427 DEBUG [f.p.e.c.f.c.BulkProcessor] {"delete":{"_index":"jobs","_type":"folder","_id":"eec8ebf874bb5f54d44cace29601ed0"}}
16:09:47,428 DEBUG [f.p.e.c.f.FsCrawlerImpl] Fs crawler is going to sleep for 1m
16:09:51,931 DEBUG [f.p.e.c.f.c.BulkProcessor] Going to execute new bulk composed of 1 actions
16:09:51,937 DEBUG [f.p.e.c.f.c.ElasticsearchClient] bulk response: BulkResponse{items=[BulkItemTopLevelResponse{index=null, delete=BulkItemResponse{failed=false, index='jobs', type='folder', id='eec8ebf874bb5f54d44cace29601ed0', opType=null, failureMessage='null'}}]}
16:09:51,938 DEBUG [f.p.e.c.f.c.BulkProcessor] Executed bulk composed of 1 actions

(Mark Dendrix Garcia) #16

@dadoonet i notice that on the second log the changelog.txt repeats after (changelog.txt) can be indexed:[true] while the bago5.pdf didn't.

*sigh I dont know whats going on I tried repeating every process


(David Pilato) #17

Please format your code using </> icon as explained in this guide. It will make your post more readable.

Or use markdown style like:

```
CODE
```

I updated your answers.

The problem here is

creation date:  2017-02-16T11:08:43
file date    :  2017-02-16T11:08:43
last scan date: 2017-02-17T16:08:45.365

So your FS does not change the file modification date when you move it to the folder.
The only way I believe to fix it is to do:

touch /mapr/my.cluster.com/vm1/bago5.pdf

So it will get a more recent date.


(Mark Dendrix Garcia) #18

But this directories are NFS mounted. Which the user only drag and drop the files. So the only way is to Touch?


(David Pilato) #19

Not sure if I can do anything on FSCrawler side.

May be I need to detect what kind of implementation is the underlying FS and find if anything in the Java API can help.

But if I don't have any information about the fact that a file has been added, I can't detect that.

May be the parent directory changed? So I could detect it? But really unsure.

If you have yourself a way to detect changes easily like with a shell script, then you can think of using FSCrawler as a gateway to elasticsearch and activate the REST endpoint.

See https://github.com/dadoonet/fscrawler#rest-service

Then you can send within your script a file with:

curl -F "file=@/path/to/yourfile.txt" "http://127.0.0.1:8080/fscrawler/_upload"

I hope this helps.

Can you open an issue in FSCrawler with all details (sounds like you are using MapR) and a scenario to reproduce it? I'll try to play with MapR if time allows.


(Mark Dendrix Garcia) #20

@dadoonet

I see its a bit clearer now, so FSCrawler first run assures that all files are index because it doesnt look on timestamps, wherein the second run will check all the files timestamp and determine what files are new and indexed it?

Is this correct analogy ?


(David Pilato) #21

exact. You can always restart from scratch by using --restart option which will remove the status file and will reindex everything.


(Mark Dendrix Garcia) #22

can I do that automatically ? without typing the whole ./fscrawler jobs --restart command ?


(David Pilato) #23

You mean? Reindex everything at every run?

Not really with one single command.

But, you can start fscrawler from a crontab and run something like:

bin/fscrawler job_name --restart --loop 1

It will run FSCrawler only once then exit. And anytime you restart this command it will restart from scratch for only one run.


(Mark Dendrix Garcia) #24

Yes but I mean is zero administration on re-indexing. The project involves on running the fscrawler and just leave it there then employees start to dump every files every records on every folders.


(Mark Dendrix Garcia) #25

its like every 15minutes I run the whole thing ./fscrawler jobs --restart command like a task scheduler bat file in windows?


(Mark Dendrix Garcia) #26

BTW im on a CentOS machine


(Mark Dendrix Garcia) #27

@dadoonet
Also so this FsCrawler is set to run once? because of the timestamp identifications


(David Pilato) #28

Exactly. But with --loop 1.

With this option, FSCrawler will run only once then exit.


(Mark Dendrix Garcia) #29

I think i managed to build a solution by adding it on the crontab itself

"* * * * * bash /fscrawler.sh jobs --restart"

in this command every one minute crontab runs the script itself making the automation without needing the timestamp to be adjust.


(Mark Dendrix Garcia) #30

Anyway what is the behavior of FSCrawler when --loop 1 and someone deleted a file in the directory ?


(David Pilato) #31

This is wrong. You will end up with a lot of processes running in parallel.

You should use:

* * * * * bash /fscrawler.sh jobs --restart --loop 1

Well. If you don't use --restart option, any time you launch again fscrawler, it should detect files which have been removed in the meantime.

If you use --restart I think that it will not detect files removal. But that's a guess as I never tested that.

loop has no effect on detection. It's only there to exit after a given number of runs.