Retrieve metadata of document in response to queries

I am indexing the text content of a file by using
res = es.index(index=myIndex,body=jsonDict), where jsonDict is just the text of a pdf file in a json dictionary.
I am able to query the content and retrieve the relevant text.
My question is : how do I index the file contents so I could also retrieve file info such as :-
"file" : {
"extension" : "pdf",
"content_type" : "application/pdf",
"created" : "2019-10-01T15:28:31.000+0000",
"last_modified" : "2019-10-01T15:28:31.000+0000",
"last_accessed" : "2019-10-01T15:38:50.000+0000",
"indexing_date" : "2019-10-01T15:39:08.055+0000",
"filesize" : 877861,

"path" : {
"root" : "4a997482a3826d51751b8e7c01e476c",
"virtual" : "/P_GB27980_20120213.pdf",
"real" : "/Users/madabhuc/Documents/IR/presto/eval/P_GB27980_20120213.pdf"

Do I need to explicitly provide this info when I submit the files to ElasticSearch for indexing ?

You need to provide what you want to index. The full document that is.

The background is that I am trying to index pages of a pdf document which is currently not supported by fscrawler. So I am trying to interface with ElasticSearch by myself.
It is not clear to me how you are providing the file to ES : are you writing the tika output to a file and then sending it to ES ?

Thanks for your help.

From which language? Java?


In jsonDict you should provide the full JSON content, including the metadata and the extracted text.

including filesize ? How do I know path.root as in the example ?

" path" : {
"root" : "4a997482a3826d51751b8e7c01e476c",
"virtual" : "/P_GB27980_20120213.pdf",
"real" : "/Users/madabhuc/Documents/IR/presto/eval/P_GB27980_20120213.pdf"

You send whatever you want to index. If this data is not needed, don't send it.
I have hard time to understand what your actual problem is.

When I look at the output in response to a query, I see not just the content, but also data such as "path" as seen above. I understand that I have to pass in whatever data that I want to be seen in a query result. It is not clear to me how i would know the value of "root" as an example. or can i assume that such data is generated automatically by ES.

No data is generated automatically.

id is, that's why i wanted to clarify.
So how do you know the value of "root" ?

Many Thanks.

I don't know.

I can tell you how FSCrawler computes all that but that's another story. It's all FSCrawler internals.