FSCrawler Rest with Elasticsearch

Dear All,

I had configured fscrawler 2.10 on ubuntu which works great, got some minor issues: -

a. why touch is required? i can understand timestamp has to be new on copied or uploaded files in directory so fscrawler looks only for newer ones? why? why it doesnt just pickup/crawl whatever new file uploaded regardless on date.

b. rest api- i have uploaded using multi-form data got below details:-

{
"ok": true,
"filename": "temp2.doc",
"url": "http://10.0.10.10:9200/es-cms/_doc/b7ba5b225841673daf7554132bca1",
"doc": {
"content": "\nTextbausteine:",
"meta": {
"date": "2021-06-07T06:49:00.000+00:00",
"modifier": "Test Autho",
"created": "2021-06-02T10:56:00.000+00:00"
},
"file": {
"extension": "doc",
"content_type": "application/msword",
"indexing_date": "2023-01-24T06:57:56.019+00:00",
"filename": "temp2.doc"
},
"path": {
"virtual": "temp2.doc",
"real": "temp2.doc"
}
}
}

my concern is; where is this temp2.doc file physically uploaded? cant find this file on fs: url location. so where does it really uploads?
elasticsearch url can be used to view details form above output, but why this upload is not shown in minotoring-dicover section-kibana?

c. kibana-doscover shows only manually copied files into directory. doesnt detects or shows from rest-api why is it so?

Welcome!

That's because of the current implementation. I want to implement a WatchService but it's not there.

It's not uploaded as sadly Elasticsearch does not have a binary blob store like s3-like. The idea is that you share using an http server the source file somewhere.
If you want, you can activate the store_source option:

https://fscrawler.readthedocs.io/en/latest/admin/fs/local-fs.html#storing-binary-source-document

But I don't recommend it unless you are storing very tiny documents (maximum some kilobytes).

My guess is that Kibana uses a date which is not available in case of the REST interface. Could you check if you are using a date field in the Kibana index pattern and which one is it?

Dear Mr. David,

Appreciate your time for the clarifications. Just need your advise further:

"It's not uploaded as sadly Elasticsearch does not have a binary blob store like s3-like. The idea is that you share using an http server the source file somewhere.
If you want, you can activate the store_source But I don't recommend it unless you are storing very tiny documents (maximum some kilobytes)."
**** My expectation is to store uploaded file from Rest API to file system for e.g. "/home/testweb/fscrawler-2.10/uploads" apart from getting indexed to ES and data to ES should reflect this uploaded file location in :
"path": {
"virtual": "temp2.doc",
"real": "temp2.doc"

Regarding kibana i am using Time field: 'file.created' - Should I be using index_created?

That's a nice idea actually. Could you please open a feature request in the project?

I'm wondering if the REST service should actually set the upload date in a field. Could you open a feature request for this as well?

In the meantime, you can just disable the date field in the Kibana index pattern so you don't have to pick one.

Thanks alot again, will do the needful. Appreciate your time again.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.