Hi, I am trying to find an effective way to count unique contacts. Elastic 8.15
Index structure
{
id, email, phone, first, last
}
Data Example
{
1, user@gmail.com, 1111111, Tom, Hanks
},
{
2, user2@gmail.com, 1111111, Tom, Hanks
},
{
3, user@gmail.com, 22222, Tom, Hanks
},
{
4, user4@gmail.com, 22222, Kate, Ragan
},
Each record contains something common and there are here 4 potential duplicate records
when I count using each field I receive
email - 2
phone - 2,
FullName -3
The expected result is 4
phone or email or first + last - could determine contact
In what ways could I use to calculate that? I will be great for any ideas
Thanks
RabBit_BR
(andre.coelho)
November 19, 2024, 7:01pm
2
Hi @andreyshiryaev13
You can use a new field to create a unique identifier using properties of the document. Below, I used a processor to create fields that can be used as a unique key. I hope this helps as a starting point.
POST /_ingest/pipeline/_simulate
{
"pipeline": {
"processors": [
{
"set": {
"field": "_source.unique_id_email_full_name",
"value": "{{_source.email}}-{{_source.first}}-{{_source.last}}"
}
},
{
"set": {
"field": "_source.unique_id_phone_full_name",
"value": "{{_source.phone}}-{{_source.first}}-{{_source.last}}"
}
}
]
},
"docs": [
{
"_index": "index",
"_id": "1",
"_source": {
"email": "user@gmail.com",
"phone": "1111111",
"first": "Tom",
"last": "Hanks"
}
},
{
"_index": "index",
"_id": "2",
"_source": {
"email": "user2@gmail.com",
"phone": "1111111",
"first": "Tom",
"last": "Hanks"
}
},
{
"_index": "index",
"_id": "3",
"_source": {
"email": "user@gmail.com",
"phone": "22222",
"first": "Tom",
"last": "Hanks"
}
},
{
"_index": "index",
"_id": "4",
"_source": {
"email": "user4@gmail.com",
"phone": "22222",
"first": "Kate",
"last": "Ragan"
}
}
]
}
thank you for your answer i will try to use that
it looks like a multi-term aggregation but I need another not all 3 fields.
I have 3 criteria(phone or email or name) and each of them can determine a duplicate row.