Getting metadata of the extracted text from a file


(Amey) #1

I have converted a pdf file into json by using tika and json library in python. While i want to search for a keyword in my index, I also want to have the page no. of the pdf file from which i extracted the data to index into elasticsearch. Is it possible? If yes, then please tell what should be my approach.
TIA.


(David Pilato) #2

Please update the title of your post to something more meaningful.

I also want to have the page no. of the pdf file from which i extracted the data to index into elasticsearch. Is it possible?

Yes. But only if you know it. So it depends on how the documents you are sending to elasticsearch are looking like.
I mean that AFAIK with Tika, you don't get that back.

For example if you send something like:

PUT book/_doc/foo
{
  "text": "bar .... bla bla ... baz"
}

Then there is no page number and you can't do anything about it.

But if you send

PUT page/_doc/foo_1
{
  "page": 1,
  "text": "bar"
}
PUT page/_doc/foo_2
{
  "page": 2,
  "text": "bla bla"
}
PUT page/_doc/foo_3
{
  "page": 3,
  "text": "baz"
}

Then it's obvious. So all the hard work has to be done on the extraction side, not on elasticsearch side.


(Amey) #3

Thank you so much.


(system) #4

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.