SignificantTerms Agg : _superset_size greater than doc_count !?

tomlameche · September 17, 2015, 2:54pm

Hello,
My appology for my bad english...

I'm trying to script a custom score for significant aggregation, and i'm very surprised with the value off the variables _superset_freq and _superset_size.

=> _superset_size is greater than doc_count : very strange no ?
=> _superset_freq for each buckets is greater than a simple count with term aggregation for the same term

I don't understand what happened...

Does anybody have an explanation ? Perhaps there is something i do wrong...

Exemple :

Thanks,
Tom

Mark_Harwood · September 17, 2015, 4:15pm

"subset" relates to the docs that match your query/parent bucket in the agg tree.
"superset" relates to the index from which these are drawn (or your choice of background_filter).

The stats in significant_terms calculations essentially perform a diff between popularity of terms in the subset and their popularity in the superset.

Does this help clarify?

tomlameche · September 17, 2015, 5:28pm

yes, the definition of subset and superset is clear.
My problem is that the value of _superset_size in a script is greater than the total number of document in my index. I think there is a bug somewhere.

I do a nested significant terms aggregation after a simple term aggregation, similar as the exemple describe here : https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-significantterms-aggregation.html

I need to perform a custom score and so i try to used "script_heuristic" : the result seems strange for me, so i modify my script to view the value of each variable (_superset_size, _superset_freq, _subset_freq and _subsetset_size) with :
"script_heuristic": {
"script": "_superset_size"
}

And what a surprise : the value of _superset_size is greater than the number of total document in my index...

In addition, the value of bg_count is greater than the value of total count for each terms, as descibe here Bg_counts in nested significant_terms aggregation

There is a bug, i guess

Mark_Harwood · September 17, 2015, 5:38pm

Are you using any of the following:

nested docs?
aliases with filters?
indexes with deleted docs?
multiple indices/shards where some of them do not contain fields or content related to the query?

They all have the potential to skew the stats. The fact that these stats are approximate is acknowledged here: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-significantterms-aggregation.html#_approximate_counts

tomlameche · September 18, 2015, 7:42am

Thanks for your help, yes i'm using nested docs, and perhaps indexes with deleted docs.

So, it's not a bug, it's a feature

Mark_Harwood · September 18, 2015, 8:28am

We use the fastest source of stats which is the pre-computed counts held in the Lucene index which are susceptible to the accuracy issues I outlined. I'd probably add to the list of gotchas the situation when you have multiple document types in the same index.

However, it is possible to define an alternative source of background stats which relies on re-counting values on the fly for all docs that match a given filter [1]. This may provide a way to fix the accuracy issues you are experiencing.

[1] https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-significantterms-aggregation.html#_custom_background_context

Topic		Replies	Views
Support buckets-path with significant terms aggregations Elasticsearch	9	1230	July 26, 2017
Bg_counts in nested significant_terms aggregation Elasticsearch	3	1276	July 5, 2017
Get super set frequency on significant terms aggregations Elasticsearch	1	589	July 6, 2017
Detail questions about significant terms aggregation Elasticsearch	3	583	July 6, 2017
Significant terms aggregations results dependent on size request parameter? Elasticsearch	2	442	July 5, 2017

SignificantTerms Agg : _superset_size greater than doc_count !?

Related topics