SignificantTerms Agg : _superset_size greater than doc_count !?


(Tomlameche) #1

Hello,
My appology for my bad english...

I'm trying to script a custom score for significant aggregation, and i'm very surprised with the value off the variables _superset_freq and _superset_size.

=> _superset_size is greater than doc_count : very strange no ?
=> _superset_freq for each buckets is greater than a simple count with term aggregation for the same term

I don't understand what happened...

Does anybody have an explanation ? Perhaps there is something i do wrong...

Exemple :

Thanks,
Tom


(Mark Harwood) #2

"subset" relates to the docs that match your query/parent bucket in the agg tree.
"superset" relates to the index from which these are drawn (or your choice of background_filter).

The stats in significant_terms calculations essentially perform a diff between popularity of terms in the subset and their popularity in the superset.

Does this help clarify?


(Tomlameche) #3

yes, the definition of subset and superset is clear.
My problem is that the value of _superset_size in a script is greater than the total number of document in my index. I think there is a bug somewhere.

I do a nested significant terms aggregation after a simple term aggregation, similar as the exemple describe here : https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-significantterms-aggregation.html

I need to perform a custom score and so i try to used "script_heuristic" : the result seems strange for me, so i modify my script to view the value of each variable (_superset_size, _superset_freq, _subset_freq and _subsetset_size) with :
"script_heuristic": {
"script": "_superset_size"
}

And what a surprise : the value of _superset_size is greater than the number of total document in my index...

In addition, the value of bg_count is greater than the value of total count for each terms, as descibe here Bg_counts in nested significant_terms aggregation

There is a bug, i guess


(Mark Harwood) #4

Are you using any of the following:

  • nested docs?
  • aliases with filters?
  • indexes with deleted docs?
  • multiple indices/shards where some of them do not contain fields or content related to the query?

They all have the potential to skew the stats. The fact that these stats are approximate is acknowledged here: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-significantterms-aggregation.html#_approximate_counts


(Tomlameche) #5

Thanks for your help, yes i'm using nested docs, and perhaps indexes with deleted docs.

So, it's not a bug, it's a feature :wink:


(Mark Harwood) #6

We use the fastest source of stats which is the pre-computed counts held in the Lucene index which are susceptible to the accuracy issues I outlined. I'd probably add to the list of gotchas the situation when you have multiple document types in the same index.

However, it is possible to define an alternative source of background stats which relies on re-counting values on the fly for all docs that match a given filter [1]. This may provide a way to fix the accuracy issues you are experiencing.

[1] https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-significantterms-aggregation.html#_custom_background_context


(system) #7