I am using ES 2.2 and we are fetching data from services and ingesting data in ES. Problem is, I have duplicate records in ES with same data but ID generated is different.
We are generating id and indexing using bulk API.
bulkRequest.add(client4Bulk.prepareIndex(indexName, indexType,co.getIndexKey()).setSource(json));
every night we index data using this. Now we have multiple duplicates records. How we should update existing records. How to check whether records exist and we should use update API or index as fresh record.
Is there any way to delete same records? How to handle this? Please suggest.
I also try to remove duplicate records using aggs like below
GET eb_portal_index/purchase_order/_search
{
"query": {
"match": {
"po_no": "201642399"
}
}
,"aggs":{
"dedup" : {
"terms":{
"field": "pol"
},
"aggs":{
"dedup_docs":{
"top_hits":{
"size":1
}
}
}
}
}
}
But this is also not working. is there any way to remove all duplicate records from search response basis on some field.?
My suggestion here is more about indexing data, not updating existing data
Since you are controlling how data is indexed and you are generating your own id, you can try the following:
if the entire record is exactly the same when the dup occurs, you can take an MD5 hash of the record and use it as the value of the id, instead of using UUID for example. When you index the data, use this MD5 hash as the id and pass it to ES along with the document (or record) to be indexed. When ES sees the same id in the index, it will remove the old one and insert the new one (if I remember it correctly)
now, if your definition of a dup is slightly different, you need to define what field or a combined fields (kind of like composite key in database terminology) in the document to be indexed can help in determining the uniqueness of a record and from this uniqueness, you identify the dups. If this is the case, take the MD5 hash on the value of the field or the combined fields and use the value as the id
For my data, I use the first approach to eliminate duplicated documents (or records) in the index. I've not used the second approach but had many discussions with the customers about their definition of a dup, and that was one of the popular cases.
This link also helps addressing the upsert and other cases too
Your requirements might be different but in my own opinion, I rather have a clean index with no duplications.
Upsert is an interesting operation and it works well if you have a database backend. ES is a search engine, not a database (even though a lot of people think ES as a database, some even think it is a storage and more) therefore, when data is sent in, ES will index the data. I think indexing a "new document" is better than updating a field or a few fields within a document then trying to re-index the updated document (upsert operation) Also, if you are doing bulk indexing, I think it will be faster than doing "bulk upserting" (I also think ES supports that feature too)
Anyway, it all comes down to your data and requirements before deciding which approach is suitable for you.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.