Match_phrase score calculation

we are currently working with elasticsearch 6.6 and we have troubles with match_phrase and the calculated score.

Whats the setting
similarity setting is at the moment this:
"scripted_wc": {
"type": "scripted",
"script": {
"source": "return doc.freq;"

What do we want to have
For example we have 3 Documents:
"bbb aaa aaa bbb aaa",
"aaa ccc aaa bbb",
"ccc aaa aaa bbb aaa"

When we query for "aaa", the scores are 3, 2 and 3. This is correct. But if we search for "ccc aaa" we get the scores 0, 2 and 2. Our expectation is to get 0, 1, 1. When we enable explain, we see that for every added word it creates a subquery which returns 1 but sums up. So "ccc aaa aaa" return 0, 0, 3 instead of 0, 0, 1.
"bbb aaa" should return 2, 0, 1 for each document (0 of course wouldnt be returned).

Is this a bug or is our understanding that wrong?

Thanks for the help

Is this a bug or is our understanding that wrong?

That's the current design, for phrases the script is executed on each individual term and the doc.freq is the frequency of the phrase within the document. BM25 similarity is a bit special, it multiplies the idf of each term and then use it to score the phrase once. The other similarities are consistent with the scripted similarity, they just sum the score of each individual term to produce a score that is comparable with a boolean query on the same terms.

Well, is there a possibility to say that "ccc aaa" for example is a single term and not 2

Not with the current API, no. If you know the number of terms in the phrase you could use a custom boost to change the score of the phrase query ? We could also change the API to have all terms available when scoring a phrase but I am not sure how the API would look like. If you're interested, can you open an issue in github that explains the current limitation ?

Thanks for your suggestion. We did it now like you said:

doc.freq * query.boost - where query.boost is 1/count(words in query).

But I will create an issue at github too.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.