Best approach to index a book content

There are more requirements but most important ones are the following ones:

  • I should make a search query that should look on ALL pages of a single book and give me score of the whole book(all pages).
  • I should know which pages of the book have matched the term I search and which page matched the best.
  • Book can contains large amount of pages up to 500
  • Not a requirements but worth to mention, I will never need that elastic give me ALL pages, just to give me those that matched"

My initial approach was to store a document like following shape

{
   "title":"Lord of the rings",
   "publishYear":1954,
   "author":"J. R. R. Tolkien",
   "genre":[
      "fantasy",
      "drama"
   ],
   "pages":[
      {
         "pageNumber":1,
         "content":"Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum....."
      },
      {
         "pageNumber":2,
         "content":"Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum......"
      },
      {
         "pageNumber":488,
         "content":"Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum....."
      }
   ]
}

This would be a single index and pages property would be a nested type ofc.

Would be this the best approach?

  • Can I achieve to know which pages match the term im looking for and which page did the best? (I think inner hits does the job).

I'm afraid how elastic would behave storing this kind of large documents, I've read that nested documents are indexed separately but then it will perform some joins when you get the data and this joins might give a bad performance.

Another approach would be to store single pages then I get rid off nested documents but then how can I get a search for a book that looks for all its pages as globally and not individually (score all pages/documents of same book)? Also, It will make impossible some filters like filter by genre if I go for this page index approach, cause it will give me all pages of the same book as results then aggregations will be a mess.

Welcome!

If you are searching for pages, index pages. Not books.

Like:

PUT /pages/_doc/isbn_1
{
   "title":"Lord of the rings",
   "publishYear":1954,
   "author":"J. R. R. Tolkien",
   "genre":[
      "fantasy",
      "drama"
   ],
   "pageNumber":1,
   "content":"Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum....."
}
PUT /pages/_doc/isbn_2
{
   "title":"Lord of the rings",
   "publishYear":1954,
   "author":"J. R. R. Tolkien",
   "genre":[
      "fantasy",
      "drama"
   ],
   "pageNumber":2,
   "content":"Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum....."
}
PUT /pages/_doc/isbn_488
{
   "title":"Lord of the rings",
   "publishYear":1954,
   "author":"J. R. R. Tolkien",
   "genre":[
      "fantasy",
      "drama"
   ],
   "pageNumber":488,
   "content":"Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum....."
}

As I said, if I index by page I face several problems:

  • Scoring,what if I wanted to know that score of a term in the whole book? If I index by page, can I make a query that will give me the total score of all pages of every book?

  • Aggregations, what if I wanted to know how many books of certain genre do I have? Query will agregate pages and not books

  • FIltering, maybe this one goes related with the above, what if I had 2 kind of book, type1 and type2, how can i filter by type if the documents that I will get are pages and not books?

Scoring

Indeed. But in that case you want to score by book because you want to find a book, not a page. May be index both then? Books and pages?
You could use parent/child feature otherwise. But I'm not sure how elegant this solution is.

Aggregations

If you run a cardinality aggregation on the isbn number, then you will have the number of books I think.

FIltering

You can put the type of the book in each page. Just like I did for the genre.

Exactly, there is the use case where I need to find a book, just the book that match some term in his content, then there is another use case where I need to know which pages did match the term and how well a page did match. I think this can be done by the example of the book I gave but I'm concern about how big the document will be.. containing up to 500 nested objects...

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.