Using machine learning and TF-IDF for record linkage, fuzzy grouping, and deduplication?

Shrin_King · March 17, 2014, 5:28pm

Given a new big department merged from three departments. A few employees
worked for two or three departments before merging. That means, the
attributes of one person might be listed under different departments'
databases.
One additional problem is that one person can have different first names or
nick names.

These attributes of a person include
first name, last name, email, home phone, cell phone, ssn, address, etc ...

Because some values of the above could be empty, there is no unique primary
key.
Hence, we need an intelligent solution for the classification, and to put
weights for different matching rules.

Any tips to handle such deduplication tasks? Any open-source tools
available to use?

The database contains about 100 million records.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/68242e72-4aff-41a9-8a45-dc726e89aab8%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Clinton_Gormley · March 17, 2014, 6:17pm

I'd start with the more_like_this query and see how far that takes you.

clint

On 17 March 2014 18:28, Shrin King aoirex@gmail.com wrote:

Given a new big department merged from three departments. A few employees
worked for two or three departments before merging. That means, the
attributes of one person might be listed under different departments'
databases.
One additional problem is that one person can have different first names
or nick names.

These attributes of a person include
first name, last name, email, home phone, cell phone, ssn, address, etc ...

Because some values of the above could be empty, there is no unique
primary key.
Hence, we need an intelligent solution for the classification, and to put
weights for different matching rules.

Any tips to handle such deduplication tasks? Any open-source tools
available to use?

The database contains about 100 million records.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/68242e72-4aff-41a9-8a45-dc726e89aab8%40googlegroups.com https://groups.google.com/d/msgid/elasticsearch/68242e72-4aff-41a9-8a45-dc726e89aab8%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAPt3XKQsxjXeCbORQM6giPc-eP6y%3D-E2TOJCkh9oH3hmc%3Dq5zg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

jprante · March 17, 2014, 6:47pm

I can only describe the approach how I use ES for data cleaning, regarding
author names in ~50 millions of academic journal articles.

key construction phase: define the information of the identifiying key.
Candidates are last name, address, home phone (strong assumption is a
person has only one home)
create compact keys (minimal entropy) remove everything that is not
required in the identifer. For example, encode last names with phonetic
codes. Phonetic codes must match the native language, do not use english
phonetic if you do not have english names. There is a phonetic analyzer
plugin for ES that contains also german phonetic algorithms. Add numeric
key information unencoded to the identifying key as raw values (no dashes,
no hyphens). The compact key should be usable in an ES term query.
index every person record into a raw data index, using the compact key in
a not_analyzed field, and all the other person attributes (may be analyzed
or not, important is the JSON source of the information)
candidate creation phase: iterate with scan/scroll over the raw data
index and fire queries that contain a term query of the compact key. The
reason is, term queries are much faster than fuzzy queries. The better the
compact key was constructed, the more precise the results.
index all found candidate lists into a new candidate index for further
analysis
iterate with scan/scroll over the candidate index. Apply more rules on
each list that can verify all the found candidates describe the same
person. Because the candidate list in a doc is relatively short, threshold
algorithms like Levenshtein are fine here.

Jörg

On Mon, Mar 17, 2014 at 7:17 PM, Clinton Gormley clint@traveljury.comwrote:

I'd start with the more_like_this query and see how far that takes you.

clint

On 17 March 2014 18:28, Shrin King aoirex@gmail.com wrote:

Given a new big department merged from three departments. A few employees
worked for two or three departments before merging. That means, the
attributes of one person might be listed under different departments'
databases.
One additional problem is that one person can have different first names
or nick names.

These attributes of a person include
first name, last name, email, home phone, cell phone, ssn, address, etc
...

Because some values of the above could be empty, there is no unique
primary key.
Hence, we need an intelligent solution for the classification, and to put
weights for different matching rules.

Any tips to handle such deduplication tasks? Any open-source tools
available to use?

The database contains about 100 million records.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/68242e72-4aff-41a9-8a45-dc726e89aab8%40googlegroups.com https://groups.google.com/d/msgid/elasticsearch/68242e72-4aff-41a9-8a45-dc726e89aab8%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAPt3XKQsxjXeCbORQM6giPc-eP6y%3D-E2TOJCkh9oH3hmc%3Dq5zg%40mail.gmail.com https://groups.google.com/d/msgid/elasticsearch/CAPt3XKQsxjXeCbORQM6giPc-eP6y%3D-E2TOJCkh9oH3hmc%3Dq5zg%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoGGQaJBLcqLDhBx_-niCjQS6Nx64MDBAP0_jkCB6no%2B9A%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Topic		Replies	Views
Deduplicating data in ElasticSearch Elasticsearch	2	703	September 12, 2017
Indexing-time document deduplication Elasticsearch	6	2589	July 6, 2017
Document Clustering Elasticsearch	3	1171	July 6, 2017
Entity/Identity resolution Elasticsearch	16	1566	July 6, 2017
Using elasticsearch to find duplicates in dataset Elasticsearch	7	5511	July 6, 2017

Using machine learning and TF-IDF for record linkage, fuzzy grouping, and deduplication?

Related topics