Hi,
We are using Elasticsearch 5.6 to store track events. Recently we run Terms aggregation on one index to find out duplicated events which have same event type, device id, and event time. Then we remove the duplicated ones from the index. The index contains about 300k events and most of them are unique.
The following query is used to find out duplications. We loop sending this query and remove the duplicated events until nothing is found.
{
"size": 0,
"aggs": {
"DuplicatedEvents": {
"terms": {
"script": "return doc['evt_time'].value+doc['device_id'].value+doc['type'].value;",
"size": 1000,
"min_doc_count": 2
},
"aggs": {
"hits": {
"top_hits": {
"size": 1000
}
}
}
}
}
}
It runs smoothly, most of the duplications are removed. However, we noticed that few duplications are still there when we search events by a device id which had plenty of duplicated events before. We run the above query again to verify and it's pretty sure that nothing is returned. Then we try to increase the size in terms aggregation in the same query and this time it returns these duplications.
It confuses us a lot, why these duplicated events doesn't return until we increase the number of size in terms aggregation? We already loop sending the query and remove duplicated events until nothing is found.
Is there any better solution to remove duplicated documents in Elasticsearch 5.6? (upgrading to other version is not an option for us right now)
If we have to increase the size to a very large number, will it crash the cluster?
Many thanks!