Get text similarity by Levenstein distance in range 0-1

Hi all !

I search by a text field and want to get a similarity score for the entire phrase.
Algorithm - Levenshtein distance.
The result should be normalized in the range 0 - 1

Example:
text in query:

"big hat"

expected relevance score in the response:

* "big hat":            1.0
* "not big hat":        0.7
* "big black hat":      0.6

I've already figured out some of the limitations of ES:

  • max fuzziness value for match query = 2
  • if we don't have the same text in ES documents as in the query, we can't understand reference result (with score 1.0)
  • TF/IDF similarity works with tokens, not with the entire phrase and takes into account the general occurrence of the token in the index

Maybe there are some things to try.

Will be glad to any comments,
thanks

Maybe another similarity module ?
I use BM25 (by default)

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.