I have an Elasticsearch Cluster with close to a billion records (around 300 GB). The _id field was not set at the time of ingestion. I recently discovered that there are some duplicates in my data. How can I delete those. (Their _id is different but all the other fields are same). Combining four of the attributes gives me a unique identifier and that is what i would be using as the id in future when ingesting the data. Is there some practical way of deleting the duplicates without reindexing.
There isn't any tool specifically to help with deduplication of data
You may need to push an update to the old versions of the duplicated documents to mark them as old and do aggregations filtering those out.
To delete, you would use the delete API and probably use bulk api to save resources (https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-delete.html)
You can use min_doc_count to find duplicates which I guess you have already achieved (https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html)... Depending on your requirement, you could check if data is different outside of the 4 fields you use to find duplicates and then decide what to do.
This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.