Fuzzy Aggregations

Michael_Sander · April 28, 2017, 10:40pm

Elasticsearch supports fuzzy search queries and term aggregations.

I am not aware of any way to do fuzzy aggregations, but it would be a great feature. My database deals with names, misspellings are rampant. It would be great if "Brian Smith" was put into the same bucket as "Brain Smith".

If this is not possible today, is it anywhere on the roadmap? Is it even technically feasible?

Mark_Harwood · April 29, 2017, 8:11am

I've built data linking systems before and a general issue with use of any fuzzy matching on large sets (such as aggregations) is that the iterative nature of linking the data means that little errors get amplified like feedback noise between a pa and mic. 'Mark' might match 'Marc' which might then join with 'Marcy' and then 'Macy'. Entities just snowball.
Each name needs additional context to prevent this. A postcode, a date of birth, a vehicle reg plate. Names on their own are not enough and entities need to hold a rich mix of identifiers for error- free linking. This requires a much more complex system

Michael_Sander · May 2, 2017, 12:26am

Why couldn't it be setup such that: for every entity returned in a bucket, also return all entities that are within 1 edit distance away. That way the bucket "Mark" will catch "Marc", but not "Marcy". However, the bucket "Marc" would also catch "Mark" and "Marcy".

This sounds like a difficult system to build and run at scale, but perhaps possible.

Mark_Harwood · May 2, 2017, 7:06am

"Jon" would match "Joan" but not "Jonathon"

Name matching is a specialist topic and frequently backed by synonym databases (see http://www.basistech.com/text-analytics/rosette/name-indexer/ ).

My experience is that effective entity resolution relies on using added context e.g. Zip codes or phone numbers alongside names.

system · May 30, 2017, 7:21am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Using fuzzy query to find near-duplicates Elasticsearch	1	1309	July 5, 2017
Querying aggregation results Elasticsearch	2	394	July 5, 2017
Alternative to similarity (float fuzzyness) Elasticsearch	4	1075	July 5, 2017
Elasticsearch post-processing using phonetics Elasticsearch	1	627	March 11, 2019
Fuzzy matching and direct hit ranking Elasticsearch	10	1654	July 6, 2017

Fuzzy Aggregations

Related topics