Handling Page breaks in Elasticsearch

rahulnama · October 31, 2018, 11:00am

Hi team

I've indexed pdf documents page by page into elasticsearch.

Each json document represents on page. Whenever user search something I will return the top page to him.

But I'm not sure how to handle the page breaks. For example page may start with "and" or page may not end properly. In this case If I return these pages to user It's not going to be meaningful. Is there any work around for this scenario ?

Happy to try out any suggestion

Thanks

-Rahul

polyfractal · October 31, 2018, 8:04pm

Couple ideas:

You could index some of the last and next page in addition to the current page's content, that way you'd have some surrounding context to search or highlight.
You could index some kind of "document ID" that is shared amongst all pages, then fetch those after finding the top page. That'd let you provide the entire document to the user instead of just the top page.
You might be able to organize the same thing into parent/child relationship. Some simple metadata about the document as the parent, all the pages as child. That'd allow you to pull out the entire document, etc. I'd try to avoid this though since parent/child can be difficult to use.

rahulnama · November 1, 2018, 1:20am

Hey @polyfractal

First of all Thanks for your time.

We want to give the user as less data as possible . So we restricted it to pages. Once the top page comes on top of that we apply information retrieval algorithms to extract the paragraphs and give it back to the user. Though it’s higgly difficult we are giving our best.

I got your points 2 and 3. But not point 1. Let’s say we index the last few lines of one page and first few lines of next as one document. How am I going to use that ? please suggest

We have tried point-2. Thanks for that anyway

Regarding point-3 can we use parent child relation ship for pages. For example, a page will be parent and paragraphs inside the page will be children. Whenever something matches I can return the parent which is page. Will this work ? The point here is to make the documents small To increase the relevancy as our present documents have 30-40 lines each page.

Please suggest, if possible

Thanks again as always :)

system · November 29, 2018, 1:24am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Possible to Index PDFs by page? Elasticsearch	6	3778	July 6, 2017
Fscrawler/Elasticsearch page by page indexing Elasticsearch	6	702	July 26, 2019
Using Parent-Child Rleatioshio with Join Elasticsearch	14	796	January 7, 2019
Help Designing Index for PDF Documents Elasticsearch	10	1036	March 3, 2017
How to deal with splitted docs? Elasticsearch	1	346	March 10, 2020

Handling Page breaks in Elasticsearch

Related topics