Possible to Index PDFs by page?


(Meltemi) #1

Can elasticsearch index an attachment (PDF specifically) so the
parent/child relationship between the document (PDF) and the page are
preserved?

Our requirement dictates that matches should initially return the title of
the PDF where the match occurred. Then if user wants to drill down further
that only the actual page where the hit occurred (with highlighting)
should be presented. From there user should be able to page forward (or
back) to continue reading. We should not return the entire 100+ page
documents but only individual pages from within each document. Anyone
know how to do this with elasticsearch?

--


(Clinton Gormley) #2

Have a look at this:

clint

On Wed, Aug 29, 2012 at 7:13 PM, Meltemi mdemetrios@gmail.com wrote:

Can elasticsearch index an attachment (PDF specifically) so the
parent/child relationship between the document (PDF) and the page are
preserved?

Our requirement dictates that matches should initially return the title of
the PDF where the match occurred. Then if user wants to drill down further
that only the actual page where the hit occurred (with highlighting)
should be presented. From there user should be able to page forward (or
back) to continue reading. We should not return the entire 100+ page
documents but only individual pages from within each document. Anyone
know how to do this with elasticsearch?

--

--


(Meltemi) #3

Yeah, that's my post from a few months ago (lingering project, don't
ask)...and I got a very helpful answer on it but it doesn't answer *this
question: How to get elasticsearch to index the PDFs and include the *
page
information so we can then use the advice in that post to serve the
individual pages?!?

Do we need to break the PDFs up into individual pages and then feed them
into ES and somehow associate those individual pages back to a parent? Or
is there a way to have ES, when it indexes a whole PDF(parent), add some
kind of page meta-data to the text as it indexes each page(child)? Or is
there a better way to do this?

Thanks for any & all advice!

On Wednesday, August 29, 2012 10:41:56 AM UTC-7, Clinton Gormley wrote:

Have a look at this:

http://stackoverflow.com/questions/10854858/best-practices-for-searchable-archive-of-thousands-of-documents-pdf-and-or-xml

clint

On Wed, Aug 29, 2012 at 7:13 PM, Meltemi <mdeme...@gmail.com <javascript:>

wrote:

Can elasticsearch index an attachment (PDF specifically) so the
parent/child relationship between the document (PDF) and the page are
preserved?

Our requirement dictates that matches should initially return the title
of the PDF where the match occurred. Then if user wants to drill down
further that only the actual page where the hit occurred (with
highlighting) should be presented. From there user should be able to page
forward (or back) to continue reading. We should not return the entire
100+ page documents but only individual pages from within each
document. Anyone know how to do this with elasticsearch?

--

--


(phill) #4

I would like the same information and was wondering if Lucene payloads
could somehow be leveraged (but those are a long way away when using ES).
Here are a few problems with one page in each document. If there is
sentence that continues on the next page, a phrase won't be matched.
Another question: is a combined score of all pages for all terms
equivalent to the whole document?

recall that
idf = inverse document frequency, a formula based on the number of
documents (not pages), but it is trying to give scores to rare vs common
words, so maybe it all works out.
and
tf = term frequency in a document (not in a page)

I don't know the answer to these questions.

-Paul

On 8/29/2012 11:28 AM, Meltemi wrote:

Yeah, that's my post from a few months ago (lingering project, don't
ask)...and I got a /very/ helpful answer on it /but/ it doesn't answer
/this/question: How to get elasticsearch to index the PDFs and
/include/ the page information so we can then use the advice in that
post to serve the individual pages?!?

Do we need to break the PDFs up into individual pages and /then/ feed
them into ES and somehow associate those individual pages back to a
parent? Or is there a way to have ES, when it indexes a whole
PDF(parent), add some kind of page meta-data to the text as it indexes
each page(child)? Or is there a better way to do this?

Thanks for any & all advice!

--


(Clinton Gormley) #5

Hi Meltemi

On Wed, 2012-08-29 at 11:28 -0700, Meltemi wrote:

Yeah, that's my post from a few months ago (lingering project, don't
ask)...and I got a very helpful answer on it but it doesn't answer
thisquestion: How to get elasticsearch to index the PDFs and include
the page information so we can then use the advice in that post to
serve the individual pages?!?

Do we need to break the PDFs up into individual pages and then feed
them into ES and somehow associate those individual pages back to a
parent?

Yes, you need to do what you describe above.

Reread the answer I gave on


starting from "First the indexing part: storing your docs in
ElasticSearch:"

I give a step-by-step guid explaining how to do it.

If this doesn't answer your question, them I'm missing the bit you don't
understand.

clint

--


(Santosh B) #6

Hi,
So what design approach did you follow ?
Am thinking of storing the contents of pdf and indexing it in
ElasticSearch and storing the link in filesystem/s3 or some NOSQL.
When querying ElasticSearch use term vector to extract position offset and
then extract the contents from the file system(may be some extra bytes
before and after offset.)

On Wednesday, 29 August 2012 23:58:42 UTC+5:30, Meltemi wrote:

Yeah, that's my post from a few months ago (lingering project, don't
ask)...and I got a very helpful answer on it but it doesn't answer
thisquestion: How to get elasticsearch to index the PDFs and include
the page information so we can then use the advice in that post to
serve the individual pages?!?

Do we need to break the PDFs up into individual pages and then feed
them into ES and somehow associate those individual pages back to a parent?
Or is there a way to have ES, when it indexes a whole PDF(parent), add some
kind of page meta-data to the text as it indexes each page(child)? Or is
there a better way to do this?

Thanks for any & all advice!

On Wednesday, August 29, 2012 10:41:56 AM UTC-7, Clinton Gormley wrote:

Have a look at this:

http://stackoverflow.com/questions/10854858/best-practices-for-searchable-archive-of-thousands-of-documents-pdf-and-or-xml

clint

On Wed, Aug 29, 2012 at 7:13 PM, Meltemi mdeme...@gmail.com wrote:

Can elasticsearch index an attachment (PDF specifically) so the
parent/child relationship between the document (PDF) and the page are
preserved?

Our requirement dictates that matches should initially return the title
of the PDF where the match occurred. Then if user wants to drill down
further that only the actual page where the hit occurred (with
highlighting) should be presented. From there user should be able to page
forward (or back) to continue reading. We should not return the
entire 100+ page documents but only individual pages from within each
document. Anyone know how to do this with elasticsearch?

--

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/f8fab5a5-d7ac-4aaf-bdff-a0b12035a516%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(system) #7