Custom similarity without TF/IDF scoring

Hi all !

It isn't necessary to use TF / IDF scoring relevance for my full-text search task.
I am trying to receive score in range 0-100 based only on doc.length variable.

My formula:

"similarity": {
  "custom_similarity": {
    "type": "scripted",
    "script": {
      "source": "double norm = 100/doc.length; return norm * query.boost;"
    }
  }
}

I receive expected results if count of tokens in query <= count of tokens in document.

Suppose my query: "big bang theory"
Results:

"big bang theory",             score: 100%
"big bang theory stub1",       score: 75%
"big bang theory stub1 stub2", score: 60%

But with the same query in cases:

"big",       score: 100%
"big bang",  score: 100%

scoring doesn't work properly for me.

Some summaries

---------------------------------------------------
| Tokens_count               | Score    | Score   |
|----------------------------| expected | current |
| Query | Document | Matches |          |         |
---------------------------------------------------
|   3   |    3     |    3    |  100 %   |  100 %  |
|   3   |    6     |    3    |   50 %   |   50 %  |
|   3   |    9     |    3    |   33 %   |   33 %  |
|   3   |    1     |    1    |   33 %   |  100 %  | <-- current algorithm does not provide expected result
|   3   |    2     |    2    |   66 %   |  100 %  | <-- the same point
|   3   |    2     |    1    |   33 %   |  100 %  | <-- the same point
---------------------------------------------------

Would be glad to any advise how to update my formula or maybe to find a workaround,
thanks :slightly_smiling_face:

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.