I am using MLT queries to find out similar documents. I have a case where I
would like to set a threshold on the score for deciding which documents
should be considered as similar to the given document passed in the like
text.
In the response hits I am observing the scores ranging from 0 to 2.5. The
2.5 is the upper limit of the few test cases that I have considered while
in development. In production it may even go higher! Therefore I am
interested in knowing if there is a way to normalize the score to bring
them between 0 and 1. Naive strategy of dividing each hit score by max
score at the client side will be useless as it will produce score 1.0 for
the first hit(the one with highest score) in the ranked hits, so it will
always pass the threshold (say 0.3).
It can be also useful if I can some how predict the highest possible score
on my MLT query based on some internal formula being used by MLT for
scoring.
Can somebody please help me with these approaches?
Thanks!
--