Bg_counts in nested significant_terms aggregation

elasticpete · January 28, 2015, 2:07pm

When using a significant_terms aggregation nested inside another aggregation, e.g. terms, I get different bg_counts for the same significant term found across term buckets.
Say e.g. the outer terms agg is on a field with US state codes ("CA", "FL, "NY", etc.) and the nested significant_terms agg is on a field with the type of sport persons perform (e.g. "tennis", "golf", "skiing", etc.).
I see the following types of results:

"aggregations": {
"frequentTerms": {
"buckets": [
{
"key": "NY",
"doc_count": 2027,
"significantTerms": {
"doc_count": 2027,
"buckets": [
{
"key": "sailing",
"doc_count": 80,
"score": 0.029240945633836113,
"bg_count": 80
},
{
"key": "golf",
"doc_count": 77,
"score": 0.02907984745352633,
"bg_count": 77
}
]
}
}
,
{
"key": "CA",
"doc_count": 100,
"significantTerms": {
"doc_count": 100,
"buckets": [
{
"key": "golf",
"doc_count": 42,
"score": 0.02301730117174594,
"bg_count": 18
},
{
"key": "tennis",
"doc_count": 42,
"score": 0.012398130001513895,
"bg_count": 9
}
]
}
}
]
}
}

I would expect that the bg_count for "golf" would be identical for the two buckets (states). I have set the shard_size to a very high number, and both min_doc_count and shard_min_doc_count to 1, with no effect.

Any insights would be very appreciated.

Thanks, Petter.

tomlameche · September 17, 2015, 3:21pm

I see the same probleme, and very strange too, the _superset_freq is greater than doc_count...

Mark_Harwood · September 17, 2015, 4:39pm

I expect the problem to be related to nested docs.
They physically exist as separate docs in Lucene (from where we get some of our term frequency stats) but are accounted for differently in elasticsearch in things like top-level aggs where we like to pretend they don't exist.
The Lucene APIs we rely on for fast access to frequencies are subject to inaccuracies due to things like deleted documents but it looks like nested docs are another source of potential inaccuracies

Topic		Replies	Views
Background Count (bg_count) Remains Zero in Nested and Filtered significant_terms Aggregation Elasticsearch	3	254	November 17, 2023
Significant terms aggregation returns incorrect bg_count value when querying index with nested objects in version 8.3.3 Elasticsearch docker	3	248	May 16, 2023
Significant text on nested objects Elasticsearch	3	527	October 31, 2018
Significant Terms Aggs: bg_count equals zero Elasticsearch	2	797	July 5, 2017
Significant terms aggregation with non tokenized text Elasticsearch	2	471	July 6, 2017

Bg_counts in nested significant_terms aggregation

Related topics