Hi,
We are using Elasticsearch 5.6 to store track events. Recently we run Terms aggregation on one index to find out duplicated events which have same event type, device id, and event time. Then we remove the duplicated ones from the index. The index contains about 300k events and most of them are unique.
The following query is used to find out duplications. We loop sending this query and remove the duplicated events until nothing is found.
It runs smoothly, most of the duplications are removed. However, we noticed that few duplications are still there when we search events by a device id which had plenty of duplicated events before. We run the above query again to verify and it's pretty sure that nothing is returned. Then we try to increase the size in terms aggregation in the same query and this time it returns these duplications.
It confuses us a lot, why these duplicated events doesn't return until we increase the number of size in terms aggregation? We already loop sending the query and remove duplicated events until nothing is found.
Is there any better solution to remove duplicated documents in Elasticsearch 5.6? (upgrading to other version is not an option for us right now)
If we have to increase the size to a very large number, will it crash the cluster?
I think that you have a solution described in the documentation (note that it will require that you upgrade which you should do anyway):
If you want to retrieve all terms or all combinations of terms in a nested terms aggregation you should use the Composite aggregation which allows to paginate over all possible terms rather than setting a size greater than the cardinality of the field in the terms aggregation. The terms aggregation is meant to return the top terms and does not allow pagination.
If you have duplicates spread across multiple shards with less than 2 on all shards it is difficult to find them as only the top terms are returned from each shard and these can easily be missed. Given the small number of documents the best way might be to change to a single primary shard.
If you do not want to merge you may also be able to run multiple aggregations against subsets of your data as that may increase the chance that spread out duplicates are detected.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.