Custom score for fuzzy matching based on Levenshtein distance score

The match query accepts a fuzziness parameter which allows you to do fuzzy matching based on the Damerau-Levenshtein edit distance (see the explanation for the fuzziness parameter in our docs here).

Now, two things to keep in mind with fuzziness: the maximum edit distance that Elasticsearch supports is 2, so Smith and SSmithhh will never be a match. And, the default scoring is not quite how you want it to be.

However, if a maximum edit distance of 2 is enough for your use case, then you can use a combination of bool and constant_score queries to get to the scoring you want. The idea is that we look for all documents that match with an edit distance of 2, and assign a score of 80 to those documents. Then, if those documents also match with an edit distance of 1, increase the score by 10 points. And if those documents also match with an edit distance of 0 - assign another 10 points to get to a perfect score of 100.

So, let's say these are your documents:

PUT levtest/_doc/_bulk
{ "index" : { "_id": 1 } }
{ "name": "Smith" }
{ "index" : { "_id": 2 } }
{ "name": "Smithh" }
{ "index" : { "_id": 3 } }
{ "name": "SSmithh" }

Then your query could look like this:

GET levtest/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "constant_score": {
            "filter": {
              "match": {
                "name": {
                  "query": "Smith",
                  "fuzziness": 2
                }
              }
            },
            "boost": 80
          }
        },
        {
          "constant_score": {
            "filter": {
              "match": {
                "name": {
                  "query": "Smith",
                  "fuzziness": 1
                }
              }
            },
            "boost": 10
          }
        },
        {
          "constant_score": {
            "filter": {
              "match": {
                "name": {
                  "query": "Smith"
                }
              }
            },
            "boost": 10
          }
        }
      ]
    }
  }
}

Returning:

    "hits": [
      {
        "_index": "levtest",
        "_type": "_doc",
        "_id": "1",
        "_score": 100,
        "_source": {
          "name": "Smith"
        }
      },
      {
        "_index": "levtest",
        "_type": "_doc",
        "_id": "2",
        "_score": 90,
        "_source": {
          "name": "Smithh"
        }
      },
      {
        "_index": "levtest",
        "_type": "_doc",
        "_id": "3",
        "_score": 80,
        "_source": {
          "name": "SSmithh"
        }
      }
    ]

A warning though: queries with a "fuzziness": 2 are computationally expensive.

2 Likes