Retrieve metadata of document in response to queries

I am indexing the text content of a file by using
res = es.index(index=myIndex,body=jsonDict), where jsonDict is just the text of a pdf file in a json dictionary.
I am able to query the content and retrieve the relevant text.
My question is : how do I index the file contents so I could also retrieve file info such as :-
"file" : {
"extension" : "pdf",
"content_type" : "application/pdf",
"created" : "2019-10-01T15:28:31.000+0000",
"last_modified" : "2019-10-01T15:28:31.000+0000",
"last_accessed" : "2019-10-01T15:38:50.000+0000",
"indexing_date" : "2019-10-01T15:39:08.055+0000",
"filesize" : 877861,

and
"path" : {
"root" : "4a997482a3826d51751b8e7c01e476c",
"virtual" : "/P_GB27980_20120213.pdf",
"real" : "/Users/madabhuc/Documents/IR/presto/eval/P_GB27980_20120213.pdf"

Do I need to explicitly provide this info when I submit the files to ElasticSearch for indexing ?
Thanks.

You need to provide what you want to index. The full document that is.

The background is that I am trying to index pages of a pdf document which is currently not supported by fscrawler. So I am trying to interface with ElasticSearch by myself.
It is not clear to me how you are providing the file to ES : are you writing the tika output to a file and then sending it to ES ?

Thanks for your help.

From which language? Java?

python

In jsonDict you should provide the full JSON content, including the metadata and the extracted text.

including filesize ? How do I know path.root as in the example ?

" path" : {
"root" : "4a997482a3826d51751b8e7c01e476c",
"virtual" : "/P_GB27980_20120213.pdf",
"real" : "/Users/madabhuc/Documents/IR/presto/eval/P_GB27980_20120213.pdf"
....
}

You send whatever you want to index. If this data is not needed, don't send it.
I have hard time to understand what your actual problem is.

When I look at the output in response to a query, I see not just the content, but also data such as "path" as seen above. I understand that I have to pass in whatever data that I want to be seen in a query result. It is not clear to me how i would know the value of "root" as an example. or can i assume that such data is generated automatically by ES.

No data is generated automatically.

id is, that's why i wanted to clarify.
So how do you know the value of "root" ?

Many Thanks.

I don't know.

I can tell you how FSCrawler computes all that but that's another story. It's all FSCrawler internals.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.