I am not aware of any way to do fuzzy aggregations, but it would be a great feature. My database deals with names, misspellings are rampant. It would be great if "Brian Smith" was put into the same bucket as "Brain Smith".
If this is not possible today, is it anywhere on the roadmap? Is it even technically feasible?
I've built data linking systems before and a general issue with use of any fuzzy matching on large sets (such as aggregations) is that the iterative nature of linking the data means that little errors get amplified like feedback noise between a pa and mic. 'Mark' might match 'Marc' which might then join with 'Marcy' and then 'Macy'. Entities just snowball.
Each name needs additional context to prevent this. A postcode, a date of birth, a vehicle reg plate. Names on their own are not enough and entities need to hold a rich mix of identifiers for error- free linking. This requires a much more complex system
Why couldn't it be setup such that: for every entity returned in a bucket, also return all entities that are within 1 edit distance away. That way the bucket "Mark" will catch "Marc", but not "Marcy". However, the bucket "Marc" would also catch "Mark" and "Marcy".
This sounds like a difficult system to build and run at scale, but perhaps possible.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.