Using aggregation in Elasticsearch to find duplicate data in the same index

Amraneze · December 25, 2020, 1:54pm

Hi everyone,

I'm trying to find duplicates within the same index. I tried to use aggregation like this :

{
	"aggs": {
		"duplicates": {
			"terms": {
				"script": {
					"source": "if (doc['username.sort-field'].size() == 0 || doc['firstName.sort-field'].size() == 0 || doc['lastName.sort-field'].size() == 0 || doc['email.sort-field'].size() == 0) return null; doc['username.sort-field'].value + doc['firstName.sort-field'].value + doc['lastName.sort-field'].value +  doc['email.sort-field'].value",
					"lang": "painless"
				}
			},
			"aggs": {
				"duplicateDocuments": {
					"top_hits": {
						"size": 1,
						"_source": [
							"username",
							"firstName",
							"lastName",
							"email"
						]
					}
				}
			}
		}
	},
	"size": 0
}

Note: I'm giving an example with username and email, my model can have duplicates in all fields except the some ids.

And it works fine, I am getting this response from ES

{
	"took": 3264,
	"timed_out": false,
	"_shards": {
		"total": 2,
		"successful": 2,
		"skipped": 0,
		"failed": 0
	},
	"hits": {
		"total": {
			"value": 10000,
			"relation": "gte"
		},
		"max_score": null,
		"hits": []
	},
	"aggregations": {
		"duplicates": {
			"doc_count_error_upper_bound": 179,
			"sum_other_doc_count": 676366,
			"buckets": [
				{
					"key": "key1...",
					"doc_count": 371,
					"duplicateDocuments": {
						"hits": {
							"total": {
								"value": 371,
								"relation": "eq"
							},
							"max_score": 1.0,
							"hits": [
								....
							]
						}
					}
				},
				{
					"key": "key2...",
					"doc_count": 353,
					"duplicateDocuments": {
						"hits": {
							"total": {
								"value": 353,
								"relation": "eq"
							},
							"max_score": 1.0,
							"hits": [
								....
							]
						}
					}
				}
				...
			]
		}
	}
}

I have two questions:

Is there any other way to find duplicates in an easy way ?
If the only way to find duplicates is with aggregation, is there a way to have a ES response like this ?

{
	"took": 3264,
	"timed_out": false,
	"_shards": {
		"total": 2,
		"successful": 2,
		"skipped": 0,
		"failed": 0
	},
	"hits": {
		"total": {
			"value": 10000,
			"relation": "gte"
		},
		"max_score": null,
		"hits": []
	},
	"duplicates": [
		{
			"key": "key1...",
			"doc_count": 371,
			"duplicateDocuments": {
				"hits": {
					"total": {
						"value": 371,
						"relation": "eq"
					},
					"max_score": 1.0,
					"hits": [
						....
					]
				}
			}
		},
		{
			"key": "key2...",
			"doc_count": 353,
			"duplicateDocuments": {
				"hits": {
					"total": {
						"value": 353,
						"relation": "eq"
					},
					"max_score": 1.0,
					"hits": [
						....
					]
				}
			}
		}
		...
	]
}

Otherwise, I will need a lot of checking with Scala, just this code result.aggregationsAsMap("duplicates").asInstanceOf[Map]("buckets") to get to buckets list

Thank you so much and happy holidays

ylasri · December 25, 2020, 2:32pm

Hi @Amraneze, Welcome to the community!
What is the objectif of findng duplicates ? just find duplicate or you have a plan de remove them ?

Here are somple examples on how to dedupliate data in elasticsearch

Amraneze · December 25, 2020, 2:46pm

@ylasri To find them to display them by key. Actually, I gave just an example with username, firstName, lastName and email, because email and username should be unique.

ylasri · December 25, 2020, 4:00pm

Try to use tranform

Amraneze · December 25, 2020, 4:21pm

@ylasri that would be a huge impact on the performance of the application especially for transforming millions of data

Transforms perform search aggregations on the source indices then index the results into the destination index. Therefore, a transform never takes less time or uses less resources than the aggregation and indexing processes.

ylasri · December 25, 2020, 5:41pm

Can you try this query, you will get only buckets with doc_count > 1

GET emails/_search
{
  "size": 0,
  "aggs": {
    "duplicates": {
      "composite": {
        "sources": [
          { "email": { "terms": { "field": "email" } } },
          { "username": { "terms": { "field": "username" } } },
          { "first_name": { "terms": { "field": "first_name" } } },
          { "last_name": { "terms": { "field": "last_name" } } }
        ]
      },
      "aggs": {
          "filter": {
            "bucket_selector": {
              "buckets_path": {
                "doc_count": "_count"
              },
              "script": "params.doc_count > 1"
            }
         }
      }
    }
  }
}

system · January 22, 2021, 5:42pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Duplicate documents in Elasticsearch Elasticsearch	1	1001	June 23, 2017
Finding duplicate documents or its count based on some field names Elasticsearch	5	5961	July 6, 2017
Any idea to remove the duplicates from the search results? Elasticsearch	2	2545	July 6, 2017
duplicateNames with multiple fields? Elasticsearch	8	5298	March 21, 2018
Find duplicate Elasticsearch	2	1257	July 5, 2017

Using aggregation in Elasticsearch to find duplicate data in the same index

Related topics