Find and delete duplicate documents


(Noam Tenne) #1

Sorry if this has already been asked; I've mostly seen questions of how to deal with duplicate documents in the result set, but not how to actually locate and remove them from the index.

We have a type within an index that contains ~7 million documents.
Because this data was migrated from an earlier version, there's a subset of this type that is duplicated; that is, the type contains an unknown number of documents with same data and the same ID so that using the REST-API:

  1. Getting the document by ID returns a single document.
  2. Searching by document ID returns multiple documents.
  3. Searching by term returns multiple documents.

Assuming I've got no preliminary information about the duplicate documents other than their type, is there any way I can find and delete the duplicates while keeping only one copy?

The only solution I've got that seems viable is to dump the data from the external datasource and query each ID to check for duplicates, but I was hoping there might be a more straightforward way.

Cheers,
Noam


(Patrick Kik) #2

There is no straightforward solution.

You could scroll over all documents and query for each and every document id if there are any duplicates.


(Dominik Stadler) #3

See the description at https://qbox.io/blog/minimizing-document-duplication-in-elasticsearch, it suggests something like

curl -XGET 'http://localhost:9200/employeeid/info/_search?pretty=true' -d '{
  "size": 0,
  "aggs": {
    "duplicateCount": {
      "terms": {
      "field": "name",
        "min_doc_count": 2
      },
      "aggs": {
        "duplicateDocuments": {
          "top_hits": {}
        }
      }
    }
  }
}'

which should do what you are looking for.


(Buchta) #4

Well done. But pls. tell me how to delete these documents. I have query that gives me all duplicity records I want to delete:
GET /sube-2016.08.18/commercial_act/_search
{
"query": {
"term": {
"commercial_act_type_name": "New subscription - Porting into the OSK"
}
},
"size": 0,
"aggs": {
"duplicated": {
"terms": {
"field": "id",
"min_doc_count": 2
},
"aggs": {
"duplicateDocuments": {
"top_hits": {
"sort": [
{
"@timestamp": {
"order": "desc"
}
}
],
"fields": [
"_id",
"@timestamp"
],
"size": 1
}
}
}
}
}
}

but I don't know how to send this query to delete by query.
Thanx.


(Steffen Winther Sørensen) #5

Depending on the number of your duplicate, search duplicate _id and their index and then loop through them and do DELETE on the doc id as it appear only to delete one of the duplicate.


(Buchta) #6

Thank You. I expected something like DELETE_BY_QUERY DSL example but there's no possibility to put aggregation buckets as a query result.


(Nasrin) #7

I have also the same problem, please give some solution.
Also is there any functionality like "unique index" provided in mongoDB for maintaining unique data?


(system) #8

(Alex Marquardt) #9

I have written a blog post that describes how to deduplicate documents from Elasticsearch using either Logstash or with a custom Python script. This can be found at the following URL: https://alexmarquardt.com/2018/07/23/deduplicating-documents-in-elasticsearch/