Duplicate Deletion in Elasticsearch 2.X

saksham_bathla · June 27, 2017, 9:20am

I have an Elasticsearch Cluster with close to a billion records (around 300 GB). The _id field was not set at the time of ingestion. I recently discovered that there are some duplicates in my data. How can I delete those. (Their _id is different but all the other fields are same). Combining four of the attributes gives me a unique identifier and that is what i would be using as the id in future when ingesting the data. Is there some practical way of deleting the duplicates without reindexing.

Julien · June 27, 2017, 2:16pm

There isn't any tool specifically to help with deduplication of data
You may need to push an update to the old versions of the duplicated documents to mark them as old and do aggregations filtering those out.

To delete, you would use the delete API and probably use bulk api to save resources (https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-delete.html)

You can use min_doc_count to find duplicates which I guess you have already achieved (https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html)... Depending on your requirement, you could check if data is different outside of the 4 fields you use to find duplicates and then decide what to do.

system · July 25, 2017, 2:16pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Identify and delete duplicates on several indexes Elasticsearch	1	1935	January 9, 2018
Delete all docs that have duplicate field values Elasticsearch	5	365	March 10, 2022
Effective Way to Remove Existing Duplicate Documents in ElasticSearch Elasticsearch	12	3966	January 14, 2021
Delete duplicate items Elasticsearch	1	321	July 6, 2017
How to identify and remove duplicates in Elasticsearch index Elasticsearch	4	3462	July 20, 2022

Duplicate Deletion in Elasticsearch 2.X

Related topics