Handling Page breaks in Elasticsearch

(Rahul Nama) #1

Hi team

I've indexed pdf documents page by page into elasticsearch.

Each json document represents on page. Whenever user search something I will return the top page to him.

But I'm not sure how to handle the page breaks. For example page may start with "and" or page may not end properly. In this case If I return these pages to user It's not going to be meaningful. Is there any work around for this scenario ?

Happy to try out any suggestion

Thanks :slight_smile:


(Zachary Tong) #2

Couple ideas:

  • You could index some of the last and next page in addition to the current page's content, that way you'd have some surrounding context to search or highlight.
  • You could index some kind of "document ID" that is shared amongst all pages, then fetch those after finding the top page. That'd let you provide the entire document to the user instead of just the top page.
  • You might be able to organize the same thing into parent/child relationship. Some simple metadata about the document as the parent, all the pages as child. That'd allow you to pull out the entire document, etc. I'd try to avoid this though since parent/child can be difficult to use.

(Rahul Nama) #3

Hey @polyfractal

First of all Thanks for your time. :slight_smile:

We want to give the user as less data as possible . So we restricted it to pages. Once the top page comes on top of that we apply information retrieval algorithms to extract the paragraphs and give it back to the user. Though itโ€™s higgly difficult we are giving our best.

I got your points 2 and 3. But not point 1. Letโ€™s say we index the last few lines of one page and first few lines of next as one document. How am I going to use that ? please suggest

We have tried point-2. Thanks for that anyway

Regarding point-3 can we use parent child relation ship for pages. For example, a page will be parent and paragraphs inside the page will be children. Whenever something matches I can return the parent which is page. Will this work ? The point here is to make the documents small To increase the relevancy as our present documents have 30-40 lines each page.

Please suggest, if possible

Thanks again as always :):grinning:

(system) #4

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.