Hi all !
It isn't necessary to use TF / IDF scoring relevance for my full-text search task.
I am trying to receive score in range 0-100 based only on doc.length variable.
My formula:
"similarity": {
"custom_similarity": {
"type": "scripted",
"script": {
"source": "double norm = 100/doc.length; return norm * query.boost;"
}
}
}
I receive expected results if count of tokens in query <= count of tokens in document.
Suppose my query: "big bang theory
"
Results:
"big bang theory", score: 100%
"big bang theory stub1", score: 75%
"big bang theory stub1 stub2", score: 60%
But with the same query in cases:
"big", score: 100%
"big bang", score: 100%
scoring doesn't work properly for me.
Some summaries
---------------------------------------------------
| Tokens_count | Score | Score |
|----------------------------| expected | current |
| Query | Document | Matches | | |
---------------------------------------------------
| 3 | 3 | 3 | 100 % | 100 % |
| 3 | 6 | 3 | 50 % | 50 % |
| 3 | 9 | 3 | 33 % | 33 % |
| 3 | 1 | 1 | 33 % | 100 % | <-- current algorithm does not provide expected result
| 3 | 2 | 2 | 66 % | 100 % | <-- the same point
| 3 | 2 | 1 | 33 % | 100 % | <-- the same point
---------------------------------------------------
Would be glad to any advise how to update my formula or maybe to find a workaround,
thanks