Remove redundancy of data


(Amit) #1

Hi,

I have import data into elasticSearch and I havn't given unique ID. I want to remove redundant data. Please suggest how can I do that????


(Isabel Drost-Fromm) #2

The problem you describe sounds like a deduplication problem. Your favourite search engine will give you several posts on approaches how to do that.

One question though: I guess simply re-indexing and avoiding to import the duplicates is not an option for you?


(Nik Everett) #3

I think its worth mentioning that deletes have to be cleaned out of the index at some point so sometimes creating a new index and just indexing what you want is going to be faster. In this case it really works similarly to a relational database.


(Amit) #4

Hi,

I am using python to import csv file, which has no unique ID and File size also big.. So can't re-indexing.


(Magnus B├Ąck) #5

I am using python to import csv file, which has no unique ID and File size also big.. So can't re-indexing.

I don't see why not. If you still have the CSV file you should be able to rerun the Python script to reindex, right? Changing the script to keep track of duplicates by hashing each line (or select fields from each line) and storing those hashes in a set should be pretty easy.


(system) #6