Fulltext search with fuzziness on strings containing numbers


#1

I have a use case where I need to match address strings.

Since the addresses come in many different shapes and forms I decided to put all fields in one string and do a full text search.

The main match will be done on the entity name, but also the address should match, alowing for some variations.

I observe some funny behaviour on the matching of the numbers embedded in the addresses :

Assume I have the following address

"10th Floor, Trustee House, 55 Samora Machel Avenue, Harare, ZW"

with a query

{
"match": {
"address_list": {
"query": "ADDRESS",
"operator": "and",
"fuzziness": "auto"
}
}
}

I can find misspelled addresses like

ADDRESS = "10th Floor, Truste House, 55 Samore Machel Avenue, Harare, ZW"

and I can still find them if they moved a floor up

ADDRESS = "11th Floor, Truste House, 55 Samora Machel Avenue, Harare, ZW"

but if they move next door , I will not find them anymore :

ADDRESS = "10th Floor, Trustee House, 56 Samore Machel Avenue, Harare, ZW"

if they move 100 houses up the street, I find them again :

ADDRESS = "10th Floor, Truste House, 155 Samore Machel Avenue, Harare, ZW"

Is there as way to make fuzziness work also on the numbers in the string in a more predictable manner ?


(Mayya Sharipova) #2

Hi there,
This is an expected behaviour. Fuzziness is calculated based on the Levenstein edit distance: https://www.elastic.co/guide/en/elasticsearch/reference/current/common-options.html#fuzziness
And your numbers are not represented as numbers, but as text tokens.

Your option "auto" means that an allowed edit distance will be based on the length of the term. For a short terms like your apartment 55, the allowed distance could be even 0. Set fuzziness to 1 to allow 1 edit distance, and the apartment 56 will be found.

"fuzziness": 1

Also, for a better control consider other fuzziness options, such as prefix_length and transpositions


(system) #3

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.