Retrieve metadata of document in response to queries

chari · November 21, 2019, 4:47pm

I am indexing the text content of a file by using
res = es.index(index=myIndex,body=jsonDict), where jsonDict is just the text of a pdf file in a json dictionary.
I am able to query the content and retrieve the relevant text.
My question is : how do I index the file contents so I could also retrieve file info such as :-
"file" : {
"extension" : "pdf",
"content_type" : "application/pdf",
"created" : "2019-10-01T15:28:31.000+0000",
"last_modified" : "2019-10-01T15:28:31.000+0000",
"last_accessed" : "2019-10-01T15:38:50.000+0000",
"indexing_date" : "2019-10-01T15:39:08.055+0000",
"filesize" : 877861,

and
"path" : {
"root" : "4a997482a3826d51751b8e7c01e476c",
"virtual" : "/P_GB27980_20120213.pdf",
"real" : "/Users/madabhuc/Documents/IR/presto/eval/P_GB27980_20120213.pdf"

Do I need to explicitly provide this info when I submit the files to ElasticSearch for indexing ?
Thanks.

dadoonet · November 21, 2019, 7:01pm

You need to provide what you want to index. The full document that is.

chari · November 21, 2019, 7:31pm

The background is that I am trying to index pages of a pdf document which is currently not supported by fscrawler. So I am trying to interface with ElasticSearch by myself.
It is not clear to me how you are providing the file to ES : are you writing the tika output to a file and then sending it to ES ?

Thanks for your help.

dadoonet · November 21, 2019, 7:57pm

From which language? Java?

chari · November 21, 2019, 9:10pm

python

dadoonet · November 22, 2019, 1:26pm

In jsonDict you should provide the full JSON content, including the metadata and the extracted text.

chari · November 22, 2019, 3:04pm

including filesize ? How do I know path.root as in the example ?

" path" : {
"root" : "4a997482a3826d51751b8e7c01e476c",
"virtual" : "/P_GB27980_20120213.pdf",
"real" : "/Users/madabhuc/Documents/IR/presto/eval/P_GB27980_20120213.pdf"
....
}

dadoonet · November 22, 2019, 4:08pm

You send whatever you want to index. If this data is not needed, don't send it.
I have hard time to understand what your actual problem is.

chari · November 22, 2019, 4:33pm

When I look at the output in response to a query, I see not just the content, but also data such as "path" as seen above. I understand that I have to pass in whatever data that I want to be seen in a query result. It is not clear to me how i would know the value of "root" as an example. or can i assume that such data is generated automatically by ES.

dadoonet · November 22, 2019, 4:52pm

No data is generated automatically.

chari · November 22, 2019, 5:22pm

id is, that's why i wanted to clarify.
So how do you know the value of "root" ?

Many Thanks.

dadoonet · November 22, 2019, 6:39pm

I don't know.

I can tell you how FSCrawler computes all that but that's another story. It's all FSCrawler internals.

system · December 20, 2019, 6:39pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Getting metadata of the extracted text from a file Elasticsearch	3	587	July 23, 2018
Indexing files from filesystem Elasticsearch	6	1788	July 6, 2017
Document indexing Elasticsearch	10	515	July 6, 2017
Problem with document metadata: document indexed from fs river Elasticsearch	10	454	July 6, 2017
Index binary files Elasticsearch	4	366	July 6, 2017

Retrieve metadata of document in response to queries

Related topics