I would like the same information and was wondering if Lucene payloads
could somehow be leveraged (but those are a long way away when using ES).
Here are a few problems with one page in each document. If there is
sentence that continues on the next page, a phrase won't be matched.
Another question: is a combined score of all pages for all terms
equivalent to the whole document?
idf = inverse document frequency, a formula based on the number of
documents (not pages), but it is trying to give scores to rare vs common
words, so maybe it all works out.
tf = term frequency in a document (not in a page)
I don't know the answer to these questions.
On 8/29/2012 11:28 AM, Meltemi wrote:
Yeah, that's my post from a few months ago (lingering project, don't
ask)...and I got a /very/ helpful answer on it /but/ it doesn't answer
/this/question: How to get elasticsearch to index the PDFs and
/include/ the page information so we can then use the advice in that
post to serve the individual pages?!?
Do we need to break the PDFs up into individual pages and /then/ feed
them into ES and somehow associate those individual pages back to a
parent? Or is there a way to have ES, when it indexes a whole
PDF(parent), add some kind of page meta-data to the text as it indexes
each page(child)? Or is there a better way to do this?
Thanks for any & all advice!