Match_phrase score calculation

kkendzia · March 7, 2019, 7:27am

Hi,
we are currently working with elasticsearch 6.6 and we have troubles with match_phrase and the calculated score.

Whats the setting
similarity setting is at the moment this:
"scripted_wc": {
"type": "scripted",
"script": {
"source": "return doc.freq;"
}
}

What do we want to have
For example we have 3 Documents:
"bbb aaa aaa bbb aaa",
"aaa ccc aaa bbb",
"ccc aaa aaa bbb aaa"

When we query for "aaa", the scores are 3, 2 and 3. This is correct. But if we search for "ccc aaa" we get the scores 0, 2 and 2. Our expectation is to get 0, 1, 1. When we enable explain, we see that for every added word it creates a subquery which returns 1 but sums up. So "ccc aaa aaa" return 0, 0, 3 instead of 0, 0, 1.
"bbb aaa" should return 2, 0, 1 for each document (0 of course wouldnt be returned).

Is this a bug or is our understanding that wrong?

Thanks for the help

jimczi · March 7, 2019, 12:31pm

Is this a bug or is our understanding that wrong?

That's the current design, for phrases the script is executed on each individual term and the doc.freq is the frequency of the phrase within the document. BM25 similarity is a bit special, it multiplies the idf of each term and then use it to score the phrase once. The other similarities are consistent with the scripted similarity, they just sum the score of each individual term to produce a score that is comparable with a boolean query on the same terms.

kkendzia · March 7, 2019, 1:05pm

Well, is there a possibility to say that "ccc aaa" for example is a single term and not 2

jimczi · March 7, 2019, 1:24pm

Not with the current API, no. If you know the number of terms in the phrase you could use a custom boost to change the score of the phrase query ? We could also change the API to have all terms available when scoring a phrase but I am not sure how the API would look like. If you're interested, can you open an issue in github that explains the current limitation ?

kkendzia · March 11, 2019, 6:27am

Thanks for your suggestion. We did it now like you said:

doc.freq * query.boost - where query.boost is 1/count(words in query).

But I will create an issue at github too.

system · April 8, 2019, 6:27am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Explain the function score of match_phrase_prefix Elasticsearch	2	293	October 19, 2021
Filter by match_phrase, score by phrase frequency Elasticsearch	3	708	July 5, 2017
How can I do a “match_phrase” that ranks solely on “does the phrase exists”? Elasticsearch	2	429	July 27, 2020
Elasticsearch simple scripted similarity performance issues Elasticsearch	1	428	August 4, 2020
How to count matched phrases with slop Elasticsearch	1	292	August 27, 2022

Match_phrase score calculation

Related topics