Background Count (bg_count) Remains Zero in Nested and Filtered significant_terms Aggregation

Hi everyone,

I've recently started using the significant_terms aggregation with a nested field in my index, and I've noticed that the results are very similar to those of a standard terms aggregation. This leads me to believe that the background calculations for significance might not be working as expected with nested fields. The bg_count is 0 as shown in this bucket list.


    "aggregations": {
      "significant_terms_nested": {
        "doc_count": 3823,
        "pos_filter": {
          "doc_count": 1522,
          "significant_terms": {
            "doc_count": 1522,
            "bg_count": 1445178,
            "buckets": [
              {
                "key": "chatgpt",
                "doc_count": 222,
                "score": 30746.516992131186,
                "bg_count": 0
              },
              {
                "key": "ai",
                "doc_count": 93,
                "score": 5395.764864337504,
                "bg_count": 0
              },
              {
                "key": "chatbot",
                "doc_count": 23,
                "score": 330.01054874542626,
                "bg_count": 0
              },
              {
                "key": "openai",
                "doc_count": 21,
                "score": 275.1115639046071,
                "bg_count": 0
              },
              {
                "key": "google",
                "doc_count": 19,
                "score": 225.20351532753946,
                "bg_count": 0
              },
              {
                "key": "rival",
                "doc_count": 15,
                "score": 140.3602269646585,
                "bg_count": 0
              }, ...

Here's a simplified version of my index mapping:

{
  "properties": {
    "my_field": {
      "type": "nested",
      "properties": {
        "txt": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword"
            }
          }
        },
        "pos": {
          "type": "keyword"
        }
      }
    }
  }
}

To provide more clarity, I'm using the significant_terms aggregation as follows, where I'm filtering based on the pos field before performing the aggregation:

json

{
  "significant_terms_nested": {
    "nested": {
      "path": "my_field"
    },
    "aggs": {
      "pos_filter": {
        "filter": {
          "terms": {
            "my_field.pos": ["noun", "verb", "adj"]
          }
        },
        "aggs": {
          "significant_terms": {
            "significant_terms": {
              "field": "my_field.txt.keyword",
              "size": 50
            }
          }
        }
      }
    }
  }
}

My primary questions are:

  1. When using significant_terms on a nested field, especially after filtering by a nested field's value (like pos in my case), do I need to specify to Elasticsearch which field to use for the background search? I'd expect the background scan to consider the entire index without any filters applied. If so, how do I ensure this?
  2. Is it mandatory for a field to be mapped as text for the significant_terms aggregation to work properly? Or is it sufficient if a field is only mapped as a keyword?

Initially, I mapped the field to .txt with only the keyword type. After conducting the significant_terms aggregation, I noticed that the terms returned were not as "significant" as I had expected. I began to wonder if this inconsistency was due to not mapping the field as text in addition to keyword. Hoping to get more relevant results, I made the change to include the text mapping. However, to my disappointment, this alteration didn't bring about any notable difference in the aggregation results.

Also applying this has no effect, bg_count is still 0:

"background_filter": {
  "match_all": {}
}

Steps to reproduce:

[
  {
    "txt": "word1",
    "pos": "POS_TYPE"
  },
  {
    "txt": "word2",
    "pos": "POS_TYPE"
  },
  ...
  {
    "txt": "wordN",
    "pos": "POS_TYPE"
  }
]
  1. Index documents with the my_field field structured as shown above.
  2. Apply the significant_terms aggregation using nested and filtered queries on the my_field field.
  3. Observe the bg_count in the aggregation results.

Any insights or guidance on this would be greatly appreciated. Thanks in advance!

I expect significant_terms and significant_text aggregations to have a hard time dealing with nested docs.
There are 4 numbers used to calculate significance:

  1. Foreground number of docs
  2. Foreground number of docs that have the term
  3. background number of docs
  4. background number of docs that have the term

Without nested this all works well but when using nested things may be complicated because 1) and 3) typically count non-nested docs but 2) and 4) include nested docs.
This means you can have nonsensical stats being reported like there are more docs with term Foo than there are docs in the index. I’m unclear what the exact behaviour is but anticipate issues.

From the looks of your mapping it seems you may be doing some very expensive things more generally eg using a nested doc per every word in some text? How about instead having a flat document structure and use separate fields for nouns. Adjs etc if you want to do significance analysis on them separately. Also consider the annotated text field type if you want to do interesting things with text that has been marked up as nouns etc.

I've contemplated transforming the nested structures into two separate fields, resulting in parallel arrays of identical length:

"words": ["word1", "word2", "word3", ...],
"postypes": ["noun", "verb", "adjective", ...]

I can assure that these arrays will maintain consistent lengths atleast per document. However, I'm uncertain about how to ensure, within Elasticsearch, that each word is considered only if its corresponding postype matches a specified set, e.g., ["noun", "verb"].

My feeling is that this would require an aggregation script, which is obviously no better than a nested structure, but with native aggregations.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.