I am indexing the text content of a file by using
res = es.index(index=myIndex,body=jsonDict), where jsonDict is just the text of a pdf file in a json dictionary.
I am able to query the content and retrieve the relevant text.
My question is : how do I index the file contents so I could also retrieve file info such as :-
"file" : {
"extension" : "pdf",
"content_type" : "application/pdf",
"created" : "2019-10-01T15:28:31.000+0000",
"last_modified" : "2019-10-01T15:28:31.000+0000",
"last_accessed" : "2019-10-01T15:38:50.000+0000",
"indexing_date" : "2019-10-01T15:39:08.055+0000",
"filesize" : 877861,
The background is that I am trying to index pages of a pdf document which is currently not supported by fscrawler. So I am trying to interface with ElasticSearch by myself.
It is not clear to me how you are providing the file to ES : are you writing the tika output to a file and then sending it to ES ?
When I look at the output in response to a query, I see not just the content, but also data such as "path" as seen above. I understand that I have to pass in whatever data that I want to be seen in a query result. It is not clear to me how i would know the value of "root" as an example. or can i assume that such data is generated automatically by ES.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.