Find and delete duplicate documents

noamt · July 30, 2015, 2:14pm

Sorry if this has already been asked; I've mostly seen questions of how to deal with duplicate documents in the result set, but not how to actually locate and remove them from the index.

We have a type within an index that contains ~7 million documents.
Because this data was migrated from an earlier version, there's a subset of this type that is duplicated; that is, the type contains an unknown number of documents with same data and the same ID so that using the REST-API:

Getting the document by ID returns a single document.
Searching by document ID returns multiple documents.
Searching by term returns multiple documents.

Assuming I've got no preliminary information about the duplicate documents other than their type, is there any way I can find and delete the duplicates while keeping only one copy?

The only solution I've got that seems viable is to dump the data from the external datasource and query each ID to check for duplicates, but I was hoping there might be a more straightforward way.

Cheers,
Noam

PatrickKik · August 3, 2015, 5:05am

There is no straightforward solution.

You could scroll over all documents and query for each and every document id if there are any duplicates.

Dominik_Stadler · June 15, 2016, 11:49am

See the description at https://qbox.io/blog/minimizing-document-duplication-in-elasticsearch, it suggests something like

curl -XGET 'http://localhost:9200/employeeid/info/_search?pretty=true' -d '{
  "size": 0,
  "aggs": {
    "duplicateCount": {
      "terms": {
      "field": "name",
        "min_doc_count": 2
      },
      "aggs": {
        "duplicateDocuments": {
          "top_hits": {}
        }
      }
    }
  }
}'

which should do what you are looking for.

buxticka · October 4, 2016, 10:51am

Well done. But pls. tell me how to delete these documents. I have query that gives me all duplicity records I want to delete:
GET /sube-2016.08.18/commercial_act/_search
{
"query": {
"term": {
"commercial_act_type_name": "New subscription - Porting into the OSK"
}
},
"size": 0,
"aggs": {
"duplicated": {
"terms": {
"field": "id",
"min_doc_count": 2
},
"aggs": {
"duplicateDocuments": {
"top_hits": {
"sort": [
{
"@timestamp": {
"order": "desc"
}
}
],
"fields": [
"_id",
"@timestamp"
],
"size": 1
}
}
}
}
}
}

but I don't know how to send this query to delete by query.
Thanx.

stefws · December 13, 2016, 5:16pm

Depending on the number of your duplicate, search duplicate _id and their index and then loop through them and do DELETE on the doc id as it appear only to delete one of the duplicate.

buxticka · December 13, 2016, 7:54pm

Thank You. I expected something like DELETE_BY_QUERY DSL example but there's no possibility to put aggregation buckets as a query result.

mulaninasrin · May 26, 2017, 10:17am

I have also the same problem, please give some solution.
Also is there any functionality like "unique index" provided in mongoDB for maintaining unique data?

Alex_Marquardt · July 27, 2018, 5:52pm

I have written a blog post that describes how to deduplicate documents from Elasticsearch using either Logstash or with a custom Python script. This can be found at the following URL: https://alexmarquardt.com/2018/07/23/deduplicating-documents-in-elasticsearch/

Topic		Replies	Views
Delete duplicate docs in ES 1.7 Elasticsearch	3	1549	July 5, 2017
Duplicate documents in Elasticsearch Elasticsearch	1	1001	June 23, 2017
Delete duplicate items Elasticsearch	1	333	July 6, 2017
Duplicate Deletion in Elasticsearch 2.X Elasticsearch	2	587	July 25, 2017
Duplicate results in resultset Elasticsearch	4	3038	July 6, 2017

Find and delete duplicate documents

Related topics