I'm trying to script a custom score for significant aggregation, and i'm very surprised with the value off the variables _superset_freq and _superset_size.
=> _superset_size is greater than doc_count : very strange no ?
=> _superset_freq for each buckets is greater than a simple count with term aggregation for the same term
I don't understand what happened...
Does anybody have an explanation ? Perhaps there is something i do wrong...
"subset" relates to the docs that match your query/parent bucket in the agg tree.
"superset" relates to the index from which these are drawn (or your choice of background_filter).
The stats in significant_terms calculations essentially perform a diff between popularity of terms in the subset and their popularity in the superset.
yes, the definition of subset and superset is clear.
My problem is that the value of _superset_size in a script is greater than the total number of document in my index. I think there is a bug somewhere.
I need to perform a custom score and so i try to used "script_heuristic" : the result seems strange for me, so i modify my script to view the value of each variable (_superset_size, _superset_freq, _subset_freq and _subsetset_size) with :
"script_heuristic": {
"script": "_superset_size"
}
And what a surprise : the value of _superset_size is greater than the number of total document in my index...
We use the fastest source of stats which is the pre-computed counts held in the Lucene index which are susceptible to the accuracy issues I outlined. I'd probably add to the list of gotchas the situation when you have multiple document types in the same index.
However, it is possible to define an alternative source of background stats which relies on re-counting values on the fly for all docs that match a given filter [1]. This may provide a way to fix the accuracy issues you are experiencing.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.