Using machine learning and TF-IDF for record linkage, fuzzy grouping, and deduplication?

Given a new big department merged from three departments. A few employees
worked for two or three departments before merging. That means, the
attributes of one person might be listed under different departments'
databases.
One additional problem is that one person can have different first names or
nick names.

These attributes of a person include
first name, last name, email, home phone, cell phone, ssn, address, etc ...

Because some values of the above could be empty, there is no unique primary
key.
Hence, we need an intelligent solution for the classification, and to put
weights for different matching rules.

Any tips to handle such deduplication tasks? Any open-source tools
available to use?

The database contains about 100 million records.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/68242e72-4aff-41a9-8a45-dc726e89aab8%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

1 Like

I'd start with the more_like_this query and see how far that takes you.

clint

On 17 March 2014 18:28, Shrin King aoirex@gmail.com wrote:

Given a new big department merged from three departments. A few employees
worked for two or three departments before merging. That means, the
attributes of one person might be listed under different departments'
databases.
One additional problem is that one person can have different first names
or nick names.

These attributes of a person include
first name, last name, email, home phone, cell phone, ssn, address, etc ...

Because some values of the above could be empty, there is no unique
primary key.
Hence, we need an intelligent solution for the classification, and to put
weights for different matching rules.

Any tips to handle such deduplication tasks? Any open-source tools
available to use?

The database contains about 100 million records.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/68242e72-4aff-41a9-8a45-dc726e89aab8%40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/68242e72-4aff-41a9-8a45-dc726e89aab8%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAPt3XKQsxjXeCbORQM6giPc-eP6y%3D-E2TOJCkh9oH3hmc%3Dq5zg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

I can only describe the approach how I use ES for data cleaning, regarding
author names in ~50 millions of academic journal articles.

  • key construction phase: define the information of the identifiying key.
    Candidates are last name, address, home phone (strong assumption is a
    person has only one home)

  • create compact keys (minimal entropy) remove everything that is not
    required in the identifer. For example, encode last names with phonetic
    codes. Phonetic codes must match the native language, do not use english
    phonetic if you do not have english names. There is a phonetic analyzer
    plugin for ES that contains also german phonetic algorithms. Add numeric
    key information unencoded to the identifying key as raw values (no dashes,
    no hyphens). The compact key should be usable in an ES term query.

  • index every person record into a raw data index, using the compact key in
    a not_analyzed field, and all the other person attributes (may be analyzed
    or not, important is the JSON source of the information)

  • candidate creation phase: iterate with scan/scroll over the raw data
    index and fire queries that contain a term query of the compact key. The
    reason is, term queries are much faster than fuzzy queries. The better the
    compact key was constructed, the more precise the results.

  • index all found candidate lists into a new candidate index for further
    analysis

  • iterate with scan/scroll over the candidate index. Apply more rules on
    each list that can verify all the found candidates describe the same
    person. Because the candidate list in a doc is relatively short, threshold
    algorithms like Levenshtein are fine here.

Jörg

On Mon, Mar 17, 2014 at 7:17 PM, Clinton Gormley clint@traveljury.comwrote:

I'd start with the more_like_this query and see how far that takes you.

clint

On 17 March 2014 18:28, Shrin King aoirex@gmail.com wrote:

Given a new big department merged from three departments. A few employees
worked for two or three departments before merging. That means, the
attributes of one person might be listed under different departments'
databases.
One additional problem is that one person can have different first names
or nick names.

These attributes of a person include
first name, last name, email, home phone, cell phone, ssn, address, etc
...

Because some values of the above could be empty, there is no unique
primary key.
Hence, we need an intelligent solution for the classification, and to put
weights for different matching rules.

Any tips to handle such deduplication tasks? Any open-source tools
available to use?

The database contains about 100 million records.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/68242e72-4aff-41a9-8a45-dc726e89aab8%40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/68242e72-4aff-41a9-8a45-dc726e89aab8%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAPt3XKQsxjXeCbORQM6giPc-eP6y%3D-E2TOJCkh9oH3hmc%3Dq5zg%40mail.gmail.comhttps://groups.google.com/d/msgid/elasticsearch/CAPt3XKQsxjXeCbORQM6giPc-eP6y%3D-E2TOJCkh9oH3hmc%3Dq5zg%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoGGQaJBLcqLDhBx_-niCjQS6Nx64MDBAP0_jkCB6no%2B9A%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.