Can elasticsearch index an attachment (PDF specifically) so the
parent/child relationship between the document (PDF) and the page are
preserved?
Our requirement dictates that matches should initially return the title of
the PDF where the match occurred. Then if user wants to drill down further
that only the actual page where the hit occurred (with highlighting)
should be presented. From there user should be able to page forward (or
back) to continue reading. We should not return the entire 100+ page
documents but only individual pages from within each document. Anyone
know how to do this with elasticsearch?
Can elasticsearch index an attachment (PDF specifically) so the
parent/child relationship between the document (PDF) and the page are
preserved?
Our requirement dictates that matches should initially return the title of
the PDF where the match occurred. Then if user wants to drill down further
that only the actual page where the hit occurred (with highlighting)
should be presented. From there user should be able to page forward (or
back) to continue reading. We should not return the entire 100+ page
documents but only individual pages from within each document. Anyone
know how to do this with elasticsearch?
Yeah, that's my post from a few months ago (lingering project, don't
ask)...and I got a very helpful answer on it but it doesn't answer *this question: How to get elasticsearch to index the PDFs and include the *
page information so we can then use the advice in that post to serve the
individual pages?!?
Do we need to break the PDFs up into individual pages and then feed them
into ES and somehow associate those individual pages back to a parent? Or
is there a way to have ES, when it indexes a whole PDF(parent), add some
kind of page meta-data to the text as it indexes each page(child)? Or is
there a better way to do this?
Thanks for any & all advice!
On Wednesday, August 29, 2012 10:41:56 AM UTC-7, Clinton Gormley wrote:
On Wed, Aug 29, 2012 at 7:13 PM, Meltemi <mdeme...@gmail.com <javascript:>
wrote:
Can elasticsearch index an attachment (PDF specifically) so the
parent/child relationship between the document (PDF) and the page are
preserved?
Our requirement dictates that matches should initially return the title
of the PDF where the match occurred. Then if user wants to drill down
further that only the actual page where the hit occurred (with
highlighting) should be presented. From there user should be able to page
forward (or back) to continue reading. We should not return the entire
100+ page documents but only individual pages from within each
document. Anyone know how to do this with elasticsearch?
I would like the same information and was wondering if Lucene payloads
could somehow be leveraged (but those are a long way away when using ES).
Here are a few problems with one page in each document. If there is
sentence that continues on the next page, a phrase won't be matched.
Another question: is a combined score of all pages for all terms
equivalent to the whole document?
recall that
idf = inverse document frequency, a formula based on the number of
documents (not pages), but it is trying to give scores to rare vs common
words, so maybe it all works out.
and
tf = term frequency in a document (not in a page)
I don't know the answer to these questions.
-Paul
On 8/29/2012 11:28 AM, Meltemi wrote:
Yeah, that's my post from a few months ago (lingering project, don't
ask)...and I got a /very/ helpful answer on it /but/ it doesn't answer
/this/question: How to get elasticsearch to index the PDFs and
/include/ the page information so we can then use the advice in that
post to serve the individual pages?!?
Do we need to break the PDFs up into individual pages and /then/ feed
them into ES and somehow associate those individual pages back to a
parent? Or is there a way to have ES, when it indexes a whole
PDF(parent), add some kind of page meta-data to the text as it indexes
each page(child)? Or is there a better way to do this?
Yeah, that's my post from a few months ago (lingering project, don't
ask)...and I got a very helpful answer on it but it doesn't answer
thisquestion: How to get elasticsearch to index the PDFs and include
the page information so we can then use the advice in that post to
serve the individual pages?!?
Do we need to break the PDFs up into individual pages and then feed
them into ES and somehow associate those individual pages back to a
parent?
Yes, you need to do what you describe above.
Reread the answer I gave on
starting from "First the indexing part: storing your docs in
Elasticsearch:"
I give a step-by-step guid explaining how to do it.
If this doesn't answer your question, them I'm missing the bit you don't
understand.
Hi,
So what design approach did you follow ?
Am thinking of storing the contents of pdf and indexing it in
Elasticsearch and storing the link in filesystem/s3 or some NOSQL.
When querying Elasticsearch use term vector to extract position offset and
then extract the contents from the file system(may be some extra bytes
before and after offset.)
On Wednesday, 29 August 2012 23:58:42 UTC+5:30, Meltemi wrote:
Yeah, that's my post from a few months ago (lingering project, don't
ask)...and I got a very helpful answer on it but it doesn't answer thisquestion: How to get elasticsearch to index the PDFs and include
the page information so we can then use the advice in that post to
serve the individual pages?!?
Do we need to break the PDFs up into individual pages and then feed
them into ES and somehow associate those individual pages back to a parent?
Or is there a way to have ES, when it indexes a whole PDF(parent), add some
kind of page meta-data to the text as it indexes each page(child)? Or is
there a better way to do this?
Thanks for any & all advice!
On Wednesday, August 29, 2012 10:41:56 AM UTC-7, Clinton Gormley wrote:
Can elasticsearch index an attachment (PDF specifically) so the
parent/child relationship between the document (PDF) and the page are
preserved?
Our requirement dictates that matches should initially return the title
of the PDF where the match occurred. Then if user wants to drill down
further that only the actual page where the hit occurred (with
highlighting) should be presented. From there user should be able to page
forward (or back) to continue reading. We should not return the
entire 100+ page documents but only individual pages from within each
document. Anyone know how to do this with elasticsearch?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.