We are using Elasticsearch 5.6 to store track events. Recently we run Terms aggregation on one index to find out duplicated events which have same event type, device id, and event time. Then we remove the duplicated ones from the index. The index contains about 300k events and most of them are unique.
The following query is used to find out duplications. We loop sending this query and remove the duplicated events until nothing is found.
"script": "return doc['evt_time'].value+doc['device_id'].value+doc['type'].value;",
It runs smoothly, most of the duplications are removed. However, we noticed that few duplications are still there when we search events by a device id which had plenty of duplicated events before. We run the above query again to verify and it's pretty sure that nothing is returned. Then we try to increase the size in terms aggregation in the same query and this time it returns these duplications.
It confuses us a lot, why these duplicated events doesn't return until we increase the number of size in terms aggregation? We already loop sending the query and remove duplicated events until nothing is found.
Is there any better solution to remove duplicated documents in Elasticsearch 5.6? (upgrading to other version is not an option for us right now)
If we have to increase the size to a very large number, will it crash the cluster?
I think that you have a solution described in the documentation (note that it will require that you upgrade which you should do anyway):
If you want to retrieve all terms or all combinations of terms in a nested
terms aggregation you should use the Composite aggregation which allows to paginate over all possible terms rather than setting a size greater than the cardinality of the field in the
terms aggregation. The
terms aggregation is meant to return the
top terms and does not allow pagination.
Thanks David for your help! Unfortunately we need to solve the problem before we can upgrade to latest ES version.
How large is your index? How many shards does it have?
The index has around 300K documents with 3 shards.
If you have duplicates spread across multiple shards with less than 2 on all shards it is difficult to find them as only the top terms are returned from each shard and these can easily be missed. Given the small number of documents the best way might be to change to a single primary shard.
If you do not want to merge you may also be able to run multiple aggregations against subsets of your data as that may increase the chance that spread out duplicates are detected.
What is multiple aggregations? Could you please share an example to detect duplicates? Many thanks!
Send multiple requests where each aggregation only aggregates across a subset of the data in the index, e.g. a timestamp range.
This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.