Ingesting documents (pdf, word, .txt) to elasticsearch

Mark_Dendrix_Garcia · February 15, 2017, 3:17am

Hi I want to ask about how can this be done, so here's the scenario my group and I had a usecase about logs which we technically use logstash. The goal of the "logs usecase" is to ingest and analyse all logs . Every machine dumping logs on a single data center which is hadoop (MapR distribution using MapR-FS) while logstash continuously read this inputs and send them to elasticsearch.

Now on the next use-case I know logstash is not possible as a candidate to be used on Ingesting documents and make the elasticsearch search within the documents and eventually return where the document address is (specifically like a hyperlink where the document resides like "/maprfs/documents/thisdocument.pdf").

Clients continuously dumping new documents (pdf,word,text or whatsoever) and also elasticsearch is continuously ingesting these documents and when a client search a word elasticsearch will return what document has those words while giving a hyperlink where the document resides.

Im quite puzzled on what to use or is this even possible?

dadoonet · February 15, 2017, 6:11am

Did you look at FSCrawler project?

Might be what you are looking for.

Mark_Dendrix_Garcia · February 15, 2017, 6:32am

Is this one ? https://github.com/dadoonet/fscrawler

dadoonet · February 15, 2017, 6:51am

Yes it is

Mark_Dendrix_Garcia · February 15, 2017, 9:13am

Currently looking at it right now, i did make an index which is the job name and I tried it on a single pdf. Tried to search it to kibana and elasticsearch but the index search gives me an error.

dadoonet · February 15, 2017, 10:10am

What kind of error?

Mark_Dendrix_Garcia · February 16, 2017, 12:33am

@dadoonet

Mark_Dendrix_Garcia · February 16, 2017, 12:45am

it isn't error I guess but when I try to ingest a pdf and do a search query here is the result.

Heres my _settings.json

dadoonet · February 16, 2017, 4:58am

Can you share the exact commands you used, FSCrawler logs and also use --debug option to get even more details?

And please don't share screenshots but formatted text with:

```
CODE
```

Mark_Dendrix_Garcia · February 17, 2017, 12:39am

@dadoonet
Hi ! I managed to make it work, but now I have a new problem. Every 15 minutes fscrawler search any new documents right?. I tried to adjust it to 1minute and start the fscrawler ./fscrawler job5 then after starting it i waited about some more minutes and add another documents inside the url folder but unfortunately those new documents are not indexed.

dadoonet · February 17, 2017, 6:52am

Please run it with debug option and share your logs and config file.

Mark_Dendrix_Garcia · February 17, 2017, 7:34am

Were can I see those logs ?

or should i just copy paste the logs on the screen ??

Mark_Dendrix_Garcia · February 17, 2017, 8:08am

@dadoonet first log without putting any new documents

16:07:47,322 DEBUG [f.p.e.c.f.FsCrawlerImpl] Fs crawler is now waking up again...
16:07:47,323 DEBUG [f.p.e.c.f.FsCrawlerImpl] Fs crawler thread [jobs] is now running. Run #2...
16:07:47,336 DEBUG [f.p.e.c.f.FsCrawlerImpl] indexing [/mapr/my.cluster.com/vm1] content
16:07:47,336 DEBUG [f.p.e.c.f.f.FileAbstractor] Listing local files from /mapr/my.cluster.com/vm1
16:07:47,338 DEBUG [f.p.e.c.f.f.FileAbstractor] 1 local files found
16:07:47,338 DEBUG [f.p.e.c.f.u.FsCrawlerUtil] filename = [changelog.txt], includes = [null], excludes = [[~*]]
16:07:47,339 DEBUG [f.p.e.c.f.u.FsCrawlerUtil] filename = [changelog.txt], excludes = [[~*]]
16:07:47,339 DEBUG [f.p.e.c.f.u.FsCrawlerUtil] filename = [changelog.txt], includes = [null]
16:07:47,339 DEBUG [f.p.e.c.f.FsCrawlerImpl] [changelog.txt] can be indexed: [true]
16:07:47,339 DEBUG [f.p.e.c.f.FsCrawlerImpl]   - file: changelog.txt
16:07:47,339 DEBUG [f.p.e.c.f.FsCrawlerImpl]     - not modified: creation date 2016-12-12T13:48:24 , file date 2016-12-12T13:48:24, last scan date 2017-02-17T16:06:44.904
16:07:47,340 DEBUG [f.p.e.c.f.FsCrawlerImpl] Looking for removed files in [/mapr/my.cluster.com/vm1]...
16:07:47,340 DEBUG [f.p.e.c.f.c.ElasticsearchClient] search [jobs]/[doc], request [SearchRequest{query=path.encoded:66e1f91ce6a0761b736dbb8117e542e, fields=[_source, file.filename], size=10000}]
16:07:47,348 DEBUG [f.p.e.c.f.u.FsCrawlerUtil] filename = [changelog.txt], includes = [null], excludes = [[~*]]
16:07:47,348 DEBUG [f.p.e.c.f.u.FsCrawlerUtil] filename = [changelog.txt], excludes = [[~*]]
16:07:47,349 DEBUG [f.p.e.c.f.u.FsCrawlerUtil] filename = [changelog.txt], includes = [null]
16:07:47,349 DEBUG [f.p.e.c.f.FsCrawlerImpl] Looking for removed directories in [/mapr/my.cluster.com/vm1]...
16:07:47,349 DEBUG [f.p.e.c.f.c.ElasticsearchClient] search [jobs]/[folder], request [SearchRequest{query=encoded:66e1f91ce6a0761b736dbb8117e542e, fields=[], size=10000}]
16:07:47,352 DEBUG [f.p.e.c.f.u.FsCrawlerUtil] filename = [/mapr/my.cluster.com/vm1], includes = [null], excludes = [[~*]]
16:07:47,353 DEBUG [f.p.e.c.f.u.FsCrawlerUtil] filename = [/mapr/my.cluster.com/vm1], excludes = [[~*]]
16:07:47,353 DEBUG [f.p.e.c.f.u.FsCrawlerUtil] filename = [/mapr/my.cluster.com/vm1], includes = [null]
16:07:47,353 DEBUG [f.p.e.c.f.FsCrawlerImpl] Delete folder /mapr/my.cluster.com/vm1//mapr/my.cluster.com/vm1
16:07:47,353 DEBUG [f.p.e.c.f.c.ElasticsearchClient] search [jobs]/[doc], request [SearchRequest{query=path.encoded:eec8ebf874bb5f54d44cace29601ed0, fields=[_source, file.filename], size=10000}]
16:07:47,357 DEBUG [f.p.e.c.f.c.ElasticsearchClient] search [jobs]/[folder], request [SearchRequest{query=encoded:eec8ebf874bb5f54d44cace29601ed0, fields=[], size=10000}]
16:07:47,360 DEBUG [f.p.e.c.f.FsCrawlerImpl] Deleting from ES jobs, folder, eec8ebf874bb5f54d44cace29601ed0
16:07:47,361 DEBUG [f.p.e.c.f.c.BulkProcessor] {"delete":{"_index":"jobs","_type":"folder","_id":"eec8ebf874bb5f54d44cace29601ed0"}}
16:07:47,362 DEBUG [f.p.e.c.f.FsCrawlerImpl] Fs crawler is going to sleep for 1m
16:07:51,907 DEBUG [f.p.e.c.f.c.BulkProcessor] Going to execute new bulk composed of 1 actions
16:07:51,913 DEBUG [f.p.e.c.f.c.ElasticsearchClient] bulk response: BulkResponse{items=[BulkItemTopLevelResponse{index=null, delete=BulkItemResponse{failed=false, index='jobs', type='folder', id='eec8ebf874bb5f54d44cace29601ed0', opType=null, failureMessage='null'}}]}
16:07:51,913 DEBUG [f.p.e.c.f.c.BulkProcessor] Executed bulk composed of 1 actions

Mark_Dendrix_Garcia · February 17, 2017, 8:10am

@dadoonet 2nd log after adding new document (bago5.pdf)

16:09:47,394 DEBUG [f.p.e.c.f.FsCrawlerImpl] Fs crawler is now waking up again...
16:09:47,396 DEBUG [f.p.e.c.f.FsCrawlerImpl] Fs crawler thread [jobs] is now running. Run #4...
16:09:47,401 DEBUG [f.p.e.c.f.FsCrawlerImpl] indexing [/mapr/my.cluster.com/vm1] content
16:09:47,402 DEBUG [f.p.e.c.f.f.FileAbstractor] Listing local files from /mapr/my.cluster.com/vm1
16:09:47,404 DEBUG [f.p.e.c.f.f.FileAbstractor] 2 local files found
16:09:47,405 DEBUG [f.p.e.c.f.u.FsCrawlerUtil] filename = [changelog.txt], includes = [null], excludes = [[~*]]
16:09:47,405 DEBUG [f.p.e.c.f.u.FsCrawlerUtil] filename = [changelog.txt], excludes = [[~*]]
16:09:47,405 DEBUG [f.p.e.c.f.u.FsCrawlerUtil] filename = [changelog.txt], includes = [null]
16:09:47,405 DEBUG [f.p.e.c.f.FsCrawlerImpl] [changelog.txt] can be indexed: [true]
16:09:47,405 DEBUG [f.p.e.c.f.FsCrawlerImpl]   - file: changelog.txt
16:09:47,406 DEBUG [f.p.e.c.f.FsCrawlerImpl]     - not modified: creation date 2016-12-12T13:48:24 , file date 2016-12-12T13:48:24, last scan date 2017-02-17T16:08:45.365
16:09:47,406 DEBUG [f.p.e.c.f.u.FsCrawlerUtil] filename = [bago5.pdf], includes = [null], excludes = [[~*]]
16:09:47,406 DEBUG [f.p.e.c.f.u.FsCrawlerUtil] filename = [bago5.pdf], excludes = [[~*]]
16:09:47,406 DEBUG [f.p.e.c.f.u.FsCrawlerUtil] filename = [bago5.pdf], includes = [null]
16:09:47,406 DEBUG [f.p.e.c.f.FsCrawlerImpl] [bago5.pdf] can be indexed: [true]
16:09:47,406 DEBUG [f.p.e.c.f.FsCrawlerImpl]   - file: bago5.pdf
16:09:47,406 DEBUG [f.p.e.c.f.FsCrawlerImpl]     - not modified: creation date 2017-02-16T11:08:43 , file date 2017-02-16T11:08:43, last scan date 2017-02-17T16:08:45.365
16:09:47,406 DEBUG [f.p.e.c.f.FsCrawlerImpl] Looking for removed files in [/mapr/my.cluster.com/vm1]...
16:09:47,407 DEBUG [f.p.e.c.f.c.ElasticsearchClient] search [jobs]/[doc], request [SearchRequest{query=path.encoded:66e1f91ce6a0761b736dbb8117e542e, fields=[_source, file.filename], size=10000}]
16:09:47,414 DEBUG [f.p.e.c.f.u.FsCrawlerUtil] filename = [changelog.txt], includes = [null], excludes = [[~*]]
16:09:47,414 DEBUG [f.p.e.c.f.u.FsCrawlerUtil] filename = [changelog.txt], excludes = [[~*]]
16:09:47,415 DEBUG [f.p.e.c.f.u.FsCrawlerUtil] filename = [changelog.txt], includes = [null]
16:09:47,415 DEBUG [f.p.e.c.f.FsCrawlerImpl] Looking for removed directories in [/mapr/my.cluster.com/vm1]...
16:09:47,415 DEBUG [f.p.e.c.f.c.ElasticsearchClient] search [jobs]/[folder], request [SearchRequest{query=encoded:66e1f91ce6a0761b736dbb8117e542e, fields=[], size=10000}]
16:09:47,420 DEBUG [f.p.e.c.f.u.FsCrawlerUtil] filename = [/mapr/my.cluster.com/vm1], includes = [null], excludes = [[~*]]
16:09:47,420 DEBUG [f.p.e.c.f.u.FsCrawlerUtil] filename = [/mapr/my.cluster.com/vm1], excludes = [[~*]]
16:09:47,420 DEBUG [f.p.e.c.f.u.FsCrawlerUtil] filename = [/mapr/my.cluster.com/vm1], includes = [null]
16:09:47,420 DEBUG [f.p.e.c.f.FsCrawlerImpl] Delete folder /mapr/my.cluster.com/vm1//mapr/my.cluster.com/vm1
16:09:47,421 DEBUG [f.p.e.c.f.c.ElasticsearchClient] search [jobs]/[doc], request [SearchRequest{query=path.encoded:eec8ebf874bb5f54d44cace29601ed0, fields=[_source, file.filename], size=10000}]
16:09:47,424 DEBUG [f.p.e.c.f.c.ElasticsearchClient] search [jobs]/[folder], request [SearchRequest{query=encoded:eec8ebf874bb5f54d44cace29601ed0, fields=[], size=10000}]
16:09:47,427 DEBUG [f.p.e.c.f.FsCrawlerImpl] Deleting from ES jobs, folder, eec8ebf874bb5f54d44cace29601ed0
16:09:47,427 DEBUG [f.p.e.c.f.c.BulkProcessor] {"delete":{"_index":"jobs","_type":"folder","_id":"eec8ebf874bb5f54d44cace29601ed0"}}
16:09:47,428 DEBUG [f.p.e.c.f.FsCrawlerImpl] Fs crawler is going to sleep for 1m
16:09:51,931 DEBUG [f.p.e.c.f.c.BulkProcessor] Going to execute new bulk composed of 1 actions
16:09:51,937 DEBUG [f.p.e.c.f.c.ElasticsearchClient] bulk response: BulkResponse{items=[BulkItemTopLevelResponse{index=null, delete=BulkItemResponse{failed=false, index='jobs', type='folder', id='eec8ebf874bb5f54d44cace29601ed0', opType=null, failureMessage='null'}}]}
16:09:51,938 DEBUG [f.p.e.c.f.c.BulkProcessor] Executed bulk composed of 1 actions

Mark_Dendrix_Garcia · February 17, 2017, 8:24am

@dadoonet i notice that on the second log the changelog.txt repeats after (changelog.txt) can be indexed:[true] while the bago5.pdf didn't.

*sigh I dont know whats going on I tried repeating every process

dadoonet · February 17, 2017, 8:35am

Please format your code using </> icon as explained in this guide. It will make your post more readable.

Or use markdown style like:

```
CODE
```

I updated your answers.

The problem here is

creation date:  2017-02-16T11:08:43
file date    :  2017-02-16T11:08:43
last scan date: 2017-02-17T16:08:45.365

So your FS does not change the file modification date when you move it to the folder.
The only way I believe to fix it is to do:

touch /mapr/my.cluster.com/vm1/bago5.pdf

So it will get a more recent date.

Mark_Dendrix_Garcia · February 17, 2017, 8:44am

But this directories are NFS mounted. Which the user only drag and drop the files. So the only way is to Touch?

dadoonet · February 17, 2017, 8:56am

Not sure if I can do anything on FSCrawler side.

May be I need to detect what kind of implementation is the underlying FS and find if anything in the Java API can help.

But if I don't have any information about the fact that a file has been added, I can't detect that.

May be the parent directory changed? So I could detect it? But really unsure.

If you have yourself a way to detect changes easily like with a shell script, then you can think of using FSCrawler as a gateway to elasticsearch and activate the REST endpoint.

See https://github.com/dadoonet/fscrawler#rest-service

Then you can send within your script a file with:

curl -F "file=@/path/to/yourfile.txt" "http://127.0.0.1:8080/fscrawler/_upload"

I hope this helps.

Can you open an issue in FSCrawler with all details (sounds like you are using MapR) and a scenario to reproduce it? I'll try to play with MapR if time allows.

Mark_Dendrix_Garcia · February 17, 2017, 9:02am

@dadoonet

I see its a bit clearer now, so FSCrawler first run assures that all files are index because it doesnt look on timestamps, wherein the second run will check all the files timestamp and determine what files are new and indexed it?

Is this correct analogy ?

dadoonet · February 17, 2017, 9:17am

exact. You can always restart from scratch by using --restart option which will remove the status file and will reindex everything.

Topic		Replies	Views
Indexing word, pdf documents? Elasticsearch	12	6129	July 7, 2020
How can I ingest PDF and words files and extract keywords of these documents? Elasticsearch	8	3853	June 26, 2018
Ingest pdf/doc/ppt files from HDFS to Elastic Search. Fscrawler vs es-hadoop Elasticsearch es-hadoop	2	1871	January 10, 2018
Searching for content in pdf and word documents Kibana	7	1901	August 30, 2020
Ingest Office documents using FileBeat, LogStash pipeline Elasticsearch	2	814	June 8, 2017

Ingesting documents (pdf, word, .txt) to elasticsearch

Related topics