Best approach to index a book content

Clony · April 8, 2020, 5:44pm

There are more requirements but most important ones are the following ones:

I should make a search query that should look on ALL pages of a single book and give me score of the whole book(all pages).
I should know which pages of the book have matched the term I search and which page matched the best.
Book can contains large amount of pages up to 500
Not a requirements but worth to mention, I will never need that elastic give me ALL pages, just to give me those that matched"

My initial approach was to store a document like following shape

{
   "title":"Lord of the rings",
   "publishYear":1954,
   "author":"J. R. R. Tolkien",
   "genre":[
      "fantasy",
      "drama"
   ],
   "pages":[
      {
         "pageNumber":1,
         "content":"Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum....."
      },
      {
         "pageNumber":2,
         "content":"Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum......"
      },
      {
         "pageNumber":488,
         "content":"Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum....."
      }
   ]
}

This would be a single index and pages property would be a nested type ofc.

Would be this the best approach?

Can I achieve to know which pages match the term im looking for and which page did the best? (I think inner hits does the job).

I'm afraid how elastic would behave storing this kind of large documents, I've read that nested documents are indexed separately but then it will perform some joins when you get the data and this joins might give a bad performance.

Another approach would be to store single pages then I get rid off nested documents but then how can I get a search for a book that looks for all its pages as globally and not individually (score all pages/documents of same book)? Also, It will make impossible some filters like filter by genre if I go for this page index approach, cause it will give me all pages of the same book as results then aggregations will be a mess.

dadoonet · April 9, 2020, 3:43pm

Welcome!

If you are searching for pages, index pages. Not books.

Like:

PUT /pages/_doc/isbn_1
{
   "title":"Lord of the rings",
   "publishYear":1954,
   "author":"J. R. R. Tolkien",
   "genre":[
      "fantasy",
      "drama"
   ],
   "pageNumber":1,
   "content":"Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum....."
}
PUT /pages/_doc/isbn_2
{
   "title":"Lord of the rings",
   "publishYear":1954,
   "author":"J. R. R. Tolkien",
   "genre":[
      "fantasy",
      "drama"
   ],
   "pageNumber":2,
   "content":"Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum....."
}
PUT /pages/_doc/isbn_488
{
   "title":"Lord of the rings",
   "publishYear":1954,
   "author":"J. R. R. Tolkien",
   "genre":[
      "fantasy",
      "drama"
   ],
   "pageNumber":488,
   "content":"Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum....."
}

Clony · April 10, 2020, 2:36pm

As I said, if I index by page I face several problems:

Scoring,what if I wanted to know that score of a term in the whole book? If I index by page, can I make a query that will give me the total score of all pages of every book?
Aggregations, what if I wanted to know how many books of certain genre do I have? Query will agregate pages and not books
FIltering, maybe this one goes related with the above, what if I had 2 kind of book, type1 and type2, how can i filter by type if the documents that I will get are pages and not books?

dadoonet · April 10, 2020, 3:56pm

Scoring

Indeed. But in that case you want to score by book because you want to find a book, not a page. May be index both then? Books and pages?
You could use parent/child feature otherwise. But I'm not sure how elegant this solution is.

Aggregations

If you run a cardinality aggregation on the isbn number, then you will have the number of books I think.

FIltering

You can put the type of the book in each page. Just like I did for the genre.

Clony · April 10, 2020, 4:34pm

Exactly, there is the use case where I need to find a book, just the book that match some term in his content, then there is another use case where I need to know which pages did match the term and how well a page did match. I think this can be done by the example of the book I gave but I'm concern about how big the document will be.. containing up to 500 nested objects...

system · May 8, 2020, 4:47pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Index Structure for Nested Data Elasticsearch	1	324	July 6, 2017
How to index a book Elasticsearch	2	218	March 14, 2023
Only show one hit per defined group of documents Elasticsearch	1	571	June 28, 2018
What's a good strategy for getting one or as many document per group depending on the group Elasticsearch	6	1817	December 12, 2017
Need advice how to organize data / schema? Elasticsearch	1	322	June 26, 2020

Best approach to index a book content

Related topics