Query for array matchings within multiple array

Indexed documents

{
  "book_id":"book01",
  "pages":[
    { "page_id":1, "words":["1", "2", "xx"] }
    { "page_id":2, "words":["4", "5", "xx"] }
    { "page_id":3, "words":["7", "8", "xx"] }
  ]
}
{
  "book_id":"book02",
  "pages":[
    { "page_id":1, "words":["1", "xx", "xx"] }
    { "page_id":2, "words":["4", "xx", "xx"] }
    { "page_id":3, "words":["7", "xx", "xx"] }
  ]
}

Input data

{
  "book_id":"book_new",
  "pages":[
    { "page_id":1, "words":["1", "2", "3"] }
    { "page_id":2, "words":["4", "5", "6"] }
    { "page_id":3, "words":["xx", "xx", "xx"] }
  ]
}

I have a number of books that have multiple pages. Each page has a list of words.
I would like to search for books with more-than-threshold similar pages.

Thresholds

  1. min_word_match_score : 2 (minimum score of words match between two pages)
  2. min_page_match_score : 2 (minimum number of similar pages between two books)

Key terms

  1. similar pages: Two pages that have at least min_word_match_score same words
  2. similar book: Two books that have at least min_page_match_score similar pages

Expected result

Based on the specified thresholds, the correct return should be only book01 because

  1. book01-1 and book_new-1 have score 2 (>=min_word_match_score, totalScore++)
  2. book01-2 and book_new-2 have score 2 (>=min_word_match_score, totalScore++)
  3. book01 and book_new have 2 total scores (totalScore >= min_page_match_score)

Poor search query (not working)

"bool" : {
   "should" : [
     {
        "match" : { "book_pages.visual_words" : {"query" : "1", "operator" : "OR"} },
        "match" : { "book_pages.visual_words" : {"query" : "2", "operator" : "OR"} },
        "match" : { "book_pages.visual_words" : {"query" : "3", "operator" : "OR"} }
     }
   ],
   "minimum_should_match" : 2
   "adjust_pure_negative" : true,
   "boost" : 1.0
 }
}

I first tried to make a part if the query for page match but it's not search array by array and it's just searching against words of all pages. And I am not really sure how to manage the two different scores - words-match-score and pages-match-score.

Should I dig into innerHit? Please help!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.