Indexed documents
{
"book_id":"book01",
"pages":[
{ "page_id":1, "words":["1", "2", "xx"] }
{ "page_id":2, "words":["4", "5", "xx"] }
{ "page_id":3, "words":["7", "8", "xx"] }
]
}
{
"book_id":"book02",
"pages":[
{ "page_id":1, "words":["1", "xx", "xx"] }
{ "page_id":2, "words":["4", "xx", "xx"] }
{ "page_id":3, "words":["7", "xx", "xx"] }
]
}
Input data
{
"book_id":"book_new",
"pages":[
{ "page_id":1, "words":["1", "2", "3"] }
{ "page_id":2, "words":["4", "5", "6"] }
{ "page_id":3, "words":["xx", "xx", "xx"] }
]
}
I have a number of books that have multiple pages. Each page has a list of words.
I would like to search for books with more-than-threshold similar pages.
Thresholds
- min_word_match_score : 2 (minimum score of
words
match between two pages) - min_page_match_score : 2 (minimum number of
similar pages
between two books)
Key terms
- similar pages: Two pages that have at least
min_word_match_score
same words - similar book: Two books that have at least
min_page_match_score
similar pages
Expected result
Based on the specified thresholds, the correct return should be only book01
because
- book01-1 and book_new-1 have score 2 (>=min_word_match_score, totalScore++)
- book01-2 and book_new-2 have score 2 (>=min_word_match_score, totalScore++)
- book01 and book_new have 2 total scores (totalScore >= min_page_match_score)
Poor search query (not working)
"bool" : {
"should" : [
{
"match" : { "book_pages.visual_words" : {"query" : "1", "operator" : "OR"} },
"match" : { "book_pages.visual_words" : {"query" : "2", "operator" : "OR"} },
"match" : { "book_pages.visual_words" : {"query" : "3", "operator" : "OR"} }
}
],
"minimum_should_match" : 2
"adjust_pure_negative" : true,
"boost" : 1.0
}
}
I first tried to make a part if the query for page match but it's not search array by array and it's just searching against words of all pages. And I am not really sure how to manage the two different scores - words-match-score and pages-match-score.
Should I dig into innerHit? Please help!