How to calculate unique contact duplicates using different fields

andreyshiryaev13 · November 18, 2024, 7:24pm

Hi, I am trying to find an effective way to count unique contacts. Elastic 8.15
Index structure

{
id, email, phone, first, last
}

Data Example

{
1, user@gmail.com, 1111111, Tom, Hanks
},
{
2, user2@gmail.com, 1111111, Tom, Hanks
},
{
3, user@gmail.com, 22222, Tom, Hanks
},
{
4, user4@gmail.com, 22222, Kate, Ragan
},

Each record contains something common and there are here 4 potential duplicate records

when I count using each field I receive
email - 2
phone - 2,
FullName -3
The expected result is 4

phone or email or first + last - could determine contact

In what ways could I use to calculate that? I will be great for any ideas
Thanks

RabBit_BR · November 19, 2024, 7:01pm

Hi @andreyshiryaev13

You can use a new field to create a unique identifier using properties of the document. Below, I used a processor to create fields that can be used as a unique key. I hope this helps as a starting point.

POST /_ingest/pipeline/_simulate
{
  "pipeline": {
    "processors": [
      {
        "set": {
          "field": "_source.unique_id_email_full_name",
          "value": "{{_source.email}}-{{_source.first}}-{{_source.last}}"
        }
      },
      {
        "set": {
          "field": "_source.unique_id_phone_full_name",
          "value": "{{_source.phone}}-{{_source.first}}-{{_source.last}}"
        }
      }
    ]
  },
  "docs": [
    {
      "_index": "index",
      "_id": "1",
      "_source": {
        "email": "user@gmail.com",
        "phone": "1111111",
        "first": "Tom",
        "last": "Hanks"
      }
    },
    {
      "_index": "index",
      "_id": "2",
      "_source": {
        "email": "user2@gmail.com",
        "phone": "1111111",
        "first": "Tom",
        "last": "Hanks"
      }
    },
    {
      "_index": "index",
      "_id": "3",
      "_source": {
        "email": "user@gmail.com",
        "phone": "22222",
        "first": "Tom",
        "last": "Hanks"
      }
    },
    {
      "_index": "index",
      "_id": "4",
      "_source": {
        "email": "user4@gmail.com",
        "phone": "22222",
        "first": "Kate",
        "last": "Ragan"
      }
    }
  ]
}

andreyshiryaev13 · November 19, 2024, 9:00pm

thank you for your answer i will try to use that

it looks like a multi-term aggregation but I need another not all 3 fields.

I have 3 criteria(phone or email or name) and each of them can determine a duplicate row.

system · December 17, 2024, 9:00pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Find duplicate Elasticsearch	2	1245	July 5, 2017
Finding duplicate documents or its count based on some field names Elasticsearch	5	5886	July 6, 2017
ELASTICSEARCH - Count unique value with a condition Elasticsearch	5	700	September 11, 2020
Maintain a unique field while indexing - equivalent to a UNIQUE INDEX in a relational database Elasticsearch	5	4360	July 5, 2017
Aggregation count unique values Elasticsearch	5	11376	March 13, 2018

How to calculate unique contact duplicates using different fields

Related topics