I've indexed pdf documents page by page into elasticsearch.
Each json document represents on page. Whenever user search something I will return the top page to him.
But I'm not sure how to handle the page breaks. For example page may start with "and" or page may not end properly. In this case If I return these pages to user It's not going to be meaningful. Is there any work around for this scenario ?
You could index some of the last and next page in addition to the current page's content, that way you'd have some surrounding context to search or highlight.
You could index some kind of "document ID" that is shared amongst all pages, then fetch those after finding the top page. That'd let you provide the entire document to the user instead of just the top page.
You might be able to organize the same thing into parent/child relationship. Some simple metadata about the document as the parent, all the pages as child. That'd allow you to pull out the entire document, etc. I'd try to avoid this though since parent/child can be difficult to use.
We want to give the user as less data as possible . So we restricted it to pages. Once the top page comes on top of that we apply information retrieval algorithms to extract the paragraphs and give it back to the user. Though itโs higgly difficult we are giving our best.
I got your points 2 and 3. But not point 1. Letโs say we index the last few lines of one page and first few lines of next as one document. How am I going to use that ? please suggest
We have tried point-2. Thanks for that anyway
Regarding point-3 can we use parent child relation ship for pages. For example, a page will be parent and paragraphs inside the page will be children. Whenever something matches I can return the parent which is page. Will this work ? The point here is to make the documents small To increase the relevancy as our present documents have 30-40 lines each page.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.