Fuzzy Aggregations

Elasticsearch supports fuzzy search queries and term aggregations.

I am not aware of any way to do fuzzy aggregations, but it would be a great feature. My database deals with names, misspellings are rampant. It would be great if "Brian Smith" was put into the same bucket as "Brain Smith".

If this is not possible today, is it anywhere on the roadmap? Is it even technically feasible?

I've built data linking systems before and a general issue with use of any fuzzy matching on large sets (such as aggregations) is that the iterative nature of linking the data means that little errors get amplified like feedback noise between a pa and mic. 'Mark' might match 'Marc' which might then join with 'Marcy' and then 'Macy'. Entities just snowball.
Each name needs additional context to prevent this. A postcode, a date of birth, a vehicle reg plate. Names on their own are not enough and entities need to hold a rich mix of identifiers for error- free linking. This requires a much more complex system

Why couldn't it be setup such that: for every entity returned in a bucket, also return all entities that are within 1 edit distance away. That way the bucket "Mark" will catch "Marc", but not "Marcy". However, the bucket "Marc" would also catch "Mark" and "Marcy".

This sounds like a difficult system to build and run at scale, but perhaps possible.

"Jon" would match "Joan" but not "Jonathon"

Name matching is a specialist topic and frequently backed by synonym databases (see http://www.basistech.com/text-analytics/rosette/name-indexer/ ).

My experience is that effective entity resolution relies on using added context e.g. Zip codes or phone numbers alongside names.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.