Getting metadata of the extracted text from a file

AmeyRok · June 22, 2018, 4:15pm

I have converted a pdf file into json by using tika and json library in python. While i want to search for a keyword in my index, I also want to have the page no. of the pdf file from which i extracted the data to index into elasticsearch. Is it possible? If yes, then please tell what should be my approach.
TIA.

dadoonet · June 23, 2018, 8:19am

Please update the title of your post to something more meaningful.

I also want to have the page no. of the pdf file from which i extracted the data to index into elasticsearch. Is it possible?

Yes. But only if you know it. So it depends on how the documents you are sending to elasticsearch are looking like.
I mean that AFAIK with Tika, you don't get that back.

For example if you send something like:

PUT book/_doc/foo
{
  "text": "bar .... bla bla ... baz"
}

Then there is no page number and you can't do anything about it.

But if you send

PUT page/_doc/foo_1
{
  "page": 1,
  "text": "bar"
}
PUT page/_doc/foo_2
{
  "page": 2,
  "text": "bla bla"
}
PUT page/_doc/foo_3
{
  "page": 3,
  "text": "baz"
}

Then it's obvious. So all the hard work has to be done on the extraction side, not on elasticsearch side.

AmeyRok · June 25, 2018, 10:04am

Thank you so much.

system · July 23, 2018, 10:04am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Retrieve metadata of document in response to queries Elasticsearch	12	600	December 20, 2019
Extracted text visibility from a Tika-processed attachment Elasticsearch	1	590	July 6, 2017
How can I extract clear text of an attachment file (pdf) Elasticsearch	14	3315	July 6, 2017
PDF Search Elasticsearch	2	451	October 6, 2018
How to index text files (pdf, doc, txt...) in Java? Elasticsearch	6	2631	January 18, 2023

Getting metadata of the extracted text from a file

Related topics