Custom score for fuzzy matching based on Levenshtein distance score

I have a requirement where there needs to be custom scoring on name field based on the Levenshtein distance and not the score returned by elastic search
e.g
Smith vs Smith = 100%,
Smith vs Smithh = 90%,
Smith vs SSmithhh = 60%

Has someone done this? How?
As an idea, I think native scripting would be required using some elasticsearch property that gives the distance score between two terms but I'm not sure how it can be done.

1 Like

The match query accepts a fuzziness parameter which allows you to do fuzzy matching based on the Damerau-Levenshtein edit distance (see the explanation for the fuzziness parameter in our docs here).

Now, two things to keep in mind with fuzziness: the maximum edit distance that Elasticsearch supports is 2, so Smith and SSmithhh will never be a match. And, the default scoring is not quite how you want it to be.

However, if a maximum edit distance of 2 is enough for your use case, then you can use a combination of bool and constant_score queries to get to the scoring you want. The idea is that we look for all documents that match with an edit distance of 2, and assign a score of 80 to those documents. Then, if those documents also match with an edit distance of 1, increase the score by 10 points. And if those documents also match with an edit distance of 0 - assign another 10 points to get to a perfect score of 100.

So, let's say these are your documents:

PUT levtest/_doc/_bulk
{ "index" : { "_id": 1 } }
{ "name": "Smith" }
{ "index" : { "_id": 2 } }
{ "name": "Smithh" }
{ "index" : { "_id": 3 } }
{ "name": "SSmithh" }

Then your query could look like this:

GET levtest/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "constant_score": {
            "filter": {
              "match": {
                "name": {
                  "query": "Smith",
                  "fuzziness": 2
                }
              }
            },
            "boost": 80
          }
        },
        {
          "constant_score": {
            "filter": {
              "match": {
                "name": {
                  "query": "Smith",
                  "fuzziness": 1
                }
              }
            },
            "boost": 10
          }
        },
        {
          "constant_score": {
            "filter": {
              "match": {
                "name": {
                  "query": "Smith"
                }
              }
            },
            "boost": 10
          }
        }
      ]
    }
  }
}

Returning:

    "hits": [
      {
        "_index": "levtest",
        "_type": "_doc",
        "_id": "1",
        "_score": 100,
        "_source": {
          "name": "Smith"
        }
      },
      {
        "_index": "levtest",
        "_type": "_doc",
        "_id": "2",
        "_score": 90,
        "_source": {
          "name": "Smithh"
        }
      },
      {
        "_index": "levtest",
        "_type": "_doc",
        "_id": "3",
        "_score": 80,
        "_source": {
          "name": "SSmithh"
        }
      }
    ]

A warning though: queries with a "fuzziness": 2 are computationally expensive.

2 Likes

Thank you for your reply. It does solve my problem partially. To elaborate a bit on my question, when I do a edit distance similarity between two strings in SQL(oracle) I get a certain score e.g.
select utl_match.edit_distance_similarity('Smith','Smith') from dual;
Score: 100

select utl_match.edit_distance_similarity('Smith','Smithh') from dual;
Score: 84

select utl_match.edit_distance_similarity('Smith','SSmithhh') from dual;
Score: 63

Is this something doable in elasticsearch?

There is no out of the box functionality that gives you a similarity score like that, as far as I know.

You'd have to implement this yourself as a script that you could use as a script_score function in the function score query (which is likely going to be very slow). A better way could be to implement this as a Java plugin.

Either way, you'll be looking at writing some code, unless someone else on this forum has a better idea.

2 Likes

Java Plugin does seem to be the best approach to go at it. However I was not able to find enough documentation anywhere to get me started, especially when going with gradle. I'd really appreciate if you could point me to something that will help me build the skeleton of the plugin or something already there that can be modified (what I already looked at seems to be a bit confusing and seems incomplete in terms of information). Rest would be just implementation that would return the score as per the requirement.

There are some blogs out there on creating Elasticsearch plugins that may be helpful. If you're running into specific problems, it's probably better to create a new post on this forum to give your question some visibility.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.