Sorry if this has already been asked; I've mostly seen questions of how to deal with duplicate documents in the result set, but not how to actually locate and remove them from the index.
We have a type within an index that contains ~7 million documents.
Because this data was migrated from an earlier version, there's a subset of this type that is duplicated; that is, the type contains an unknown number of documents with same data and the same ID so that using the REST-API:
Getting the document by ID returns a single document.
Searching by document ID returns multiple documents.
Searching by term returns multiple documents.
Assuming I've got no preliminary information about the duplicate documents other than their type, is there any way I can find and delete the duplicates while keeping only one copy?
The only solution I've got that seems viable is to dump the data from the external datasource and query each ID to check for duplicates, but I was hoping there might be a more straightforward way.
Well done. But pls. tell me how to delete these documents. I have query that gives me all duplicity records I want to delete:
GET /sube-2016.08.18/commercial_act/_search
{
"query": {
"term": {
"commercial_act_type_name": "New subscription - Porting into the OSK"
}
},
"size": 0,
"aggs": {
"duplicated": {
"terms": {
"field": "id",
"min_doc_count": 2
},
"aggs": {
"duplicateDocuments": {
"top_hits": {
"sort": [
{
"@timestamp": {
"order": "desc"
}
}
],
"fields": [
"_id",
"@timestamp"
],
"size": 1
}
}
}
}
}
}
but I don't know how to send this query to delete by query.
Thanx.
Depending on the number of your duplicate, search duplicate _id and their index and then loop through them and do DELETE on the doc id as it appear only to delete one of the duplicate.
I have also the same problem, please give some solution.
Also is there any functionality like "unique index" provided in mongoDB for maintaining unique data?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.