Remove redundancy of data

Amit2015 · November 23, 2015, 7:08am

Hi,

I have import data into elasticSearch and I havn't given unique ID. I want to remove redundant data. Please suggest how can I do that????

mainec · November 23, 2015, 2:15pm

The problem you describe sounds like a deduplication problem. Your favourite search engine will give you several posts on approaches how to do that.

One question though: I guess simply re-indexing and avoiding to import the duplicates is not an option for you?

nik9000 · November 23, 2015, 2:32pm

I think its worth mentioning that deletes have to be cleaned out of the index at some point so sometimes creating a new index and just indexing what you want is going to be faster. In this case it really works similarly to a relational database.

Amit2015 · November 23, 2015, 5:32pm

Hi,

I am using python to import csv file, which has no unique ID and File size also big.. So can't re-indexing.

magnusbaeck · November 25, 2015, 6:57am

I am using python to import csv file, which has no unique ID and File size also big.. So can't re-indexing.

I don't see why not. If you still have the CSV file you should be able to rerun the Python script to reindex, right? Changing the script to keep track of duplicates by hashing each line (or select fields from each line) and storing those hashes in a set should be pretty easy.

Topic		Replies	Views
How can I delete all the duplicate records except one to keep my data? Elasticsearch	13	2385	March 9, 2020
Question on ingesting multiple csv files with repeated data Elasticsearch	7	642	July 6, 2020
How to identiry duplicates and delete it in index Elasticsearch	7	440	July 21, 2022
Remove duplicates after bulk insert exploiting ES power Elasticsearch	1	1051	March 16, 2018
How to identify and remove duplicates in Elasticsearch index Elasticsearch	3	337	July 20, 2022

Remove redundancy of data

Related topics