I'm new to the stack and learning as I go along so any help would be much appreciated.
I have 9.5 million records of people, with formatted_name, formatted_address and formatted_birthday as indexable fields in Elasticsearch.
Due to the percularities of the (not my) data collection process, it is highly likely there are duplicate records of people, but with minor differences in each record. Examples might be the digit in a birthday field differing by one or minor variations in name and/or address.
What I am trying to do is:
-
Set up a way to list a person and cluster 'similar' people with a score of that similarity (maybe something like Edit distance or something normalised to a range 0 - 1?).
-
Once I have that I would like some way of visualising these clusters using some kind of graph if possible.
Task 1 is essential and task is nice-to-have. I've been reading around and can see several potential approaches. Before I get to experimenting, I was wondering if there is a best practice approach to this problem as it must be a fairly typical use case?
Thanks in advance.