Hi everyone,
I'm trying to find duplicates within the same index. I tried to use aggregation like this :
{
"aggs": {
"duplicates": {
"terms": {
"script": {
"source": "if (doc['username.sort-field'].size() == 0 || doc['firstName.sort-field'].size() == 0 || doc['lastName.sort-field'].size() == 0 || doc['email.sort-field'].size() == 0) return null; doc['username.sort-field'].value + doc['firstName.sort-field'].value + doc['lastName.sort-field'].value + doc['email.sort-field'].value",
"lang": "painless"
}
},
"aggs": {
"duplicateDocuments": {
"top_hits": {
"size": 1,
"_source": [
"username",
"firstName",
"lastName",
"email"
]
}
}
}
}
},
"size": 0
}
Note: I'm giving an example with username and email, my model can have duplicates in all fields except the some ids.
And it works fine, I am getting this response from ES
{
"took": 3264,
"timed_out": false,
"_shards": {
"total": 2,
"successful": 2,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 10000,
"relation": "gte"
},
"max_score": null,
"hits": []
},
"aggregations": {
"duplicates": {
"doc_count_error_upper_bound": 179,
"sum_other_doc_count": 676366,
"buckets": [
{
"key": "key1...",
"doc_count": 371,
"duplicateDocuments": {
"hits": {
"total": {
"value": 371,
"relation": "eq"
},
"max_score": 1.0,
"hits": [
....
]
}
}
},
{
"key": "key2...",
"doc_count": 353,
"duplicateDocuments": {
"hits": {
"total": {
"value": 353,
"relation": "eq"
},
"max_score": 1.0,
"hits": [
....
]
}
}
}
...
]
}
}
}
I have two questions:
- Is there any other way to find duplicates in an easy way ?
- If the only way to find duplicates is with aggregation, is there a way to have a ES response like this ?
{
"took": 3264,
"timed_out": false,
"_shards": {
"total": 2,
"successful": 2,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 10000,
"relation": "gte"
},
"max_score": null,
"hits": []
},
"duplicates": [
{
"key": "key1...",
"doc_count": 371,
"duplicateDocuments": {
"hits": {
"total": {
"value": 371,
"relation": "eq"
},
"max_score": 1.0,
"hits": [
....
]
}
}
},
{
"key": "key2...",
"doc_count": 353,
"duplicateDocuments": {
"hits": {
"total": {
"value": 353,
"relation": "eq"
},
"max_score": 1.0,
"hits": [
....
]
}
}
}
...
]
}
Otherwise, I will need a lot of checking with Scala, just this code result.aggregationsAsMap("duplicates").asInstanceOf[Map]("buckets")
to get to buckets list
Thank you so much and happy holidays