Using aggregation in Elasticsearch to find duplicate data in the same index

Hi everyone,

I'm trying to find duplicates within the same index. I tried to use aggregation like this :

{
	"aggs": {
		"duplicates": {
			"terms": {
				"script": {
					"source": "if (doc['username.sort-field'].size() == 0 || doc['firstName.sort-field'].size() == 0 || doc['lastName.sort-field'].size() == 0 || doc['email.sort-field'].size() == 0) return null; doc['username.sort-field'].value + doc['firstName.sort-field'].value + doc['lastName.sort-field'].value +  doc['email.sort-field'].value",
					"lang": "painless"
				}
			},
			"aggs": {
				"duplicateDocuments": {
					"top_hits": {
						"size": 1,
						"_source": [
							"username",
							"firstName",
							"lastName",
							"email"
						]
					}
				}
			}
		}
	},
	"size": 0
}

Note: I'm giving an example with username and email, my model can have duplicates in all fields except the some ids.

And it works fine, I am getting this response from ES

{
	"took": 3264,
	"timed_out": false,
	"_shards": {
		"total": 2,
		"successful": 2,
		"skipped": 0,
		"failed": 0
	},
	"hits": {
		"total": {
			"value": 10000,
			"relation": "gte"
		},
		"max_score": null,
		"hits": []
	},
	"aggregations": {
		"duplicates": {
			"doc_count_error_upper_bound": 179,
			"sum_other_doc_count": 676366,
			"buckets": [
				{
					"key": "key1...",
					"doc_count": 371,
					"duplicateDocuments": {
						"hits": {
							"total": {
								"value": 371,
								"relation": "eq"
							},
							"max_score": 1.0,
							"hits": [
								....
							]
						}
					}
				},
				{
					"key": "key2...",
					"doc_count": 353,
					"duplicateDocuments": {
						"hits": {
							"total": {
								"value": 353,
								"relation": "eq"
							},
							"max_score": 1.0,
							"hits": [
								....
							]
						}
					}
				}
				...
			]
		}
	}
}

I have two questions:

  1. Is there any other way to find duplicates in an easy way ?
  2. If the only way to find duplicates is with aggregation, is there a way to have a ES response like this ?
{
	"took": 3264,
	"timed_out": false,
	"_shards": {
		"total": 2,
		"successful": 2,
		"skipped": 0,
		"failed": 0
	},
	"hits": {
		"total": {
			"value": 10000,
			"relation": "gte"
		},
		"max_score": null,
		"hits": []
	},
	"duplicates": [
		{
			"key": "key1...",
			"doc_count": 371,
			"duplicateDocuments": {
				"hits": {
					"total": {
						"value": 371,
						"relation": "eq"
					},
					"max_score": 1.0,
					"hits": [
						....
					]
				}
			}
		},
		{
			"key": "key2...",
			"doc_count": 353,
			"duplicateDocuments": {
				"hits": {
					"total": {
						"value": 353,
						"relation": "eq"
					},
					"max_score": 1.0,
					"hits": [
						....
					]
				}
			}
		}
		...
	]
}

Otherwise, I will need a lot of checking with Scala, just this code result.aggregationsAsMap("duplicates").asInstanceOf[Map]("buckets") to get to buckets list

Thank you so much and happy holidays :slight_smile:

Hi @Amraneze, Welcome to the community!
What is the objectif of findng duplicates ? just find duplicate or you have a plan de remove them ?

Here are somple examples on how to dedupliate data in elasticsearch


@ylasri To find them to display them by key. Actually, I gave just an example with username, firstName, lastName and email, because email and username should be unique.

Try to use tranform

@ylasri that would be a huge impact on the performance of the application especially for transforming millions of data

Transforms perform search aggregations on the source indices then index the results into the destination index. Therefore, a transform never takes less time or uses less resources than the aggregation and indexing processes.

Can you try this query, you will get only buckets with doc_count > 1

GET emails/_search
{
  "size": 0,
  "aggs": {
    "duplicates": {
      "composite": {
        "sources": [
          { "email": { "terms": { "field": "email" } } },
          { "username": { "terms": { "field": "username" } } },
          { "first_name": { "terms": { "field": "first_name" } } },
          { "last_name": { "terms": { "field": "last_name" } } }
        ]
      },
      "aggs": {
          "filter": {
            "bucket_selector": {
              "buckets_path": {
                "doc_count": "_count"
              },
              "script": "params.doc_count > 1"
            }
         }
      }
    }
  }
}

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.