Sorry if this has already been asked; I've mostly seen questions of how to deal with duplicate documents in the result set, but not how to actually locate and remove them from the index.
We have a type within an index that contains ~7 million documents.
Because this data was migrated from an earlier version, there's a subset of this type that is duplicated; that is, the type contains an unknown number of documents with same data and the same ID so that using the REST-API:
- Getting the document by ID returns a single document.
- Searching by document ID returns multiple documents.
- Searching by term returns multiple documents.
Assuming I've got no preliminary information about the duplicate documents other than their type, is there any way I can find and delete the duplicates while keeping only one copy?
The only solution I've got that seems viable is to dump the data from the external datasource and query each ID to check for duplicates, but I was hoping there might be a more straightforward way.