Terms aggregation doesn't return all hits

Zining · November 13, 2020, 2:16pm

Hi,
We are using Elasticsearch 5.6 to store track events. Recently we run Terms aggregation on one index to find out duplicated events which have same event type, device id, and event time. Then we remove the duplicated ones from the index. The index contains about 300k events and most of them are unique.

The following query is used to find out duplications. We loop sending this query and remove the duplicated events until nothing is found.

{
	"size": 0,
	"aggs": {
		"DuplicatedEvents": {
			"terms": {
				"script": "return doc['evt_time'].value+doc['device_id'].value+doc['type'].value;",
				"size": 1000,
				"min_doc_count": 2
			},
			"aggs": {
				"hits": {
					"top_hits": {
						"size": 1000
					}
				}
			}
		}
	}
}

It runs smoothly, most of the duplications are removed. However, we noticed that few duplications are still there when we search events by a device id which had plenty of duplicated events before. We run the above query again to verify and it's pretty sure that nothing is returned. Then we try to increase the size in terms aggregation in the same query and this time it returns these duplications.

It confuses us a lot, why these duplicated events doesn't return until we increase the number of size in terms aggregation? We already loop sending the query and remove duplicated events until nothing is found.

Is there any better solution to remove duplicated documents in Elasticsearch 5.6? (upgrading to other version is not an option for us right now)

If we have to increase the size to a very large number, will it crash the cluster?

Many thanks!

dadoonet · November 13, 2020, 3:10pm

I think that you have a solution described in the documentation (note that it will require that you upgrade which you should do anyway):

If you want to retrieve all terms or all combinations of terms in a nested terms aggregation you should use the Composite aggregation which allows to paginate over all possible terms rather than setting a size greater than the cardinality of the field in the terms aggregation. The terms aggregation is meant to return the top terms and does not allow pagination.

Zining · November 13, 2020, 3:24pm

Thanks David for your help! Unfortunately we need to solve the problem before we can upgrade to latest ES version.

Christian_Dahlqvist · November 13, 2020, 4:47pm

How large is your index? How many shards does it have?

Zining · November 16, 2020, 8:41am

Hi Christian,
The index has around 300K documents with 3 shards.

Christian_Dahlqvist · November 16, 2020, 9:24am

If you have duplicates spread across multiple shards with less than 2 on all shards it is difficult to find them as only the top terms are returned from each shard and these can easily be missed. Given the small number of documents the best way might be to change to a single primary shard.

Zining · November 17, 2020, 7:49am

Thanks!

Christian_Dahlqvist · November 17, 2020, 7:51am

If you do not want to merge you may also be able to run multiple aggregations against subsets of your data as that may increase the chance that spread out duplicates are detected.

Zining · November 17, 2020, 7:58am

Hi Christian,
What is multiple aggregations? Could you please share an example to detect duplicates? Many thanks!

Christian_Dahlqvist · November 17, 2020, 8:04am

Send multiple requests where each aggregation only aggregates across a subset of the data in the index, e.g. a timestamp range.

system · December 15, 2020, 8:05am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Duplicates Query not returning all results Elasticsearch	3	1552	July 5, 2017
Inconsistent aggregation results Elasticsearch	1	420	December 5, 2019
Unable to get terms + terms + top hits aggregation Elasticsearch	1	347	June 18, 2020
Metric aggregation on terms aggregation result set Elasticsearch	5	594	January 26, 2020
Aggs don't return all records Elasticsearch	2	2007	January 5, 2018

Terms aggregation doesn't return all hits

Related topics