Hi,
we are currently working with elasticsearch 6.6 and we have troubles with match_phrase and the calculated score.
Whats the setting
similarity setting is at the moment this:
"scripted_wc": {
"type": "scripted",
"script": {
"source": "return doc.freq;"
}
}
What do we want to have
For example we have 3 Documents:
"bbb aaa aaa bbb aaa",
"aaa ccc aaa bbb",
"ccc aaa aaa bbb aaa"
When we query for "aaa", the scores are 3, 2 and 3. This is correct. But if we search for "ccc aaa" we get the scores 0, 2 and 2. Our expectation is to get 0, 1, 1. When we enable explain, we see that for every added word it creates a subquery which returns 1 but sums up. So "ccc aaa aaa" return 0, 0, 3 instead of 0, 0, 1.
"bbb aaa" should return 2, 0, 1 for each document (0 of course wouldnt be returned).
That's the current design, for phrases the script is executed on each individual term and the doc.freq is the frequency of the phrase within the document. BM25 similarity is a bit special, it multiplies the idf of each term and then use it to score the phrase once. The other similarities are consistent with the scripted similarity, they just sum the score of each individual term to produce a score that is comparable with a boolean query on the same terms.
Not with the current API, no. If you know the number of terms in the phrase you could use a custom boost to change the score of the phrase query ? We could also change the API to have all terms available when scoring a phrase but I am not sure how the API would look like. If you're interested, can you open an issue in github that explains the current limitation ?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.