Given a new big department merged from three departments. A few employees
worked for two or three departments before merging. That means, the
attributes of one person might be listed under different departments'
databases.
One additional problem is that one person can have different first names or
nick names.
These attributes of a person include
first name, last name, email, home phone, cell phone, ssn, address, etc ...
Because some values of the above could be empty, there is no unique primary
key.
Hence, we need an intelligent solution for the classification, and to put
weights for different matching rules.
Any tips to handle such deduplication tasks? Any open-source tools
available to use?
Given a new big department merged from three departments. A few employees
worked for two or three departments before merging. That means, the
attributes of one person might be listed under different departments'
databases.
One additional problem is that one person can have different first names
or nick names.
These attributes of a person include
first name, last name, email, home phone, cell phone, ssn, address, etc ...
Because some values of the above could be empty, there is no unique
primary key.
Hence, we need an intelligent solution for the classification, and to put
weights for different matching rules.
Any tips to handle such deduplication tasks? Any open-source tools
available to use?
I can only describe the approach how I use ES for data cleaning, regarding
author names in ~50 millions of academic journal articles.
key construction phase: define the information of the identifiying key.
Candidates are last name, address, home phone (strong assumption is a
person has only one home)
create compact keys (minimal entropy) remove everything that is not
required in the identifer. For example, encode last names with phonetic
codes. Phonetic codes must match the native language, do not use english
phonetic if you do not have english names. There is a phonetic analyzer
plugin for ES that contains also german phonetic algorithms. Add numeric
key information unencoded to the identifying key as raw values (no dashes,
no hyphens). The compact key should be usable in an ES term query.
index every person record into a raw data index, using the compact key in
a not_analyzed field, and all the other person attributes (may be analyzed
or not, important is the JSON source of the information)
candidate creation phase: iterate with scan/scroll over the raw data
index and fire queries that contain a term query of the compact key. The
reason is, term queries are much faster than fuzzy queries. The better the
compact key was constructed, the more precise the results.
index all found candidate lists into a new candidate index for further
analysis
iterate with scan/scroll over the candidate index. Apply more rules on
each list that can verify all the found candidates describe the same
person. Because the candidate list in a doc is relatively short, threshold
algorithms like Levenshtein are fine here.
Given a new big department merged from three departments. A few employees
worked for two or three departments before merging. That means, the
attributes of one person might be listed under different departments'
databases.
One additional problem is that one person can have different first names
or nick names.
These attributes of a person include
first name, last name, email, home phone, cell phone, ssn, address, etc
...
Because some values of the above could be empty, there is no unique
primary key.
Hence, we need an intelligent solution for the classification, and to put
weights for different matching rules.
Any tips to handle such deduplication tasks? Any open-source tools
available to use?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.