Finding document similarity by distance and visualising clusters

sandydesu · March 29, 2022, 9:24am

I'm new to the stack and learning as I go along so any help would be much appreciated.

I have 9.5 million records of people, with formatted_name, formatted_address and formatted_birthday as indexable fields in Elasticsearch.

Due to the percularities of the (not my) data collection process, it is highly likely there are duplicate records of people, but with minor differences in each record. Examples might be the digit in a birthday field differing by one or minor variations in name and/or address.

What I am trying to do is:

Set up a way to list a person and cluster 'similar' people with a score of that similarity (maybe something like Edit distance or something normalised to a range 0 - 1?).
Once I have that I would like some way of visualising these clusters using some kind of graph if possible.

Task 1 is essential and task is nice-to-have. I've been reading around and can see several potential approaches. Before I get to experimenting, I was wondering if there is a best practice approach to this problem as it must be a fairly typical use case?

Thanks in advance.

system · April 26, 2022, 9:24am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
In Elasticsearch, is possible to cluster documents that share the most similar texts, without giving an initial query to compare to? Elasticsearch	3	3624	July 25, 2017
Similarties profiles retrieval from elasticsearch? Elasticsearch	2	691	July 5, 2017
Clustering data on Elasticsearch index Elasticsearch	9	4312	July 5, 2017
Find similar records through MLT from millions records Elasticsearch	1	305	January 24, 2019
Index with millions nested documents Elasticsearch	7	799	January 13, 2020

Finding document similarity by distance and visualising clusters

Related topics