Hi everyone,
I've recently started using the significant_terms
aggregation with a nested field in my index, and I've noticed that the results are very similar to those of a standard terms
aggregation. This leads me to believe that the background calculations for significance might not be working as expected with nested fields. The bg_count is 0 as shown in this bucket list.
"aggregations": {
"significant_terms_nested": {
"doc_count": 3823,
"pos_filter": {
"doc_count": 1522,
"significant_terms": {
"doc_count": 1522,
"bg_count": 1445178,
"buckets": [
{
"key": "chatgpt",
"doc_count": 222,
"score": 30746.516992131186,
"bg_count": 0
},
{
"key": "ai",
"doc_count": 93,
"score": 5395.764864337504,
"bg_count": 0
},
{
"key": "chatbot",
"doc_count": 23,
"score": 330.01054874542626,
"bg_count": 0
},
{
"key": "openai",
"doc_count": 21,
"score": 275.1115639046071,
"bg_count": 0
},
{
"key": "google",
"doc_count": 19,
"score": 225.20351532753946,
"bg_count": 0
},
{
"key": "rival",
"doc_count": 15,
"score": 140.3602269646585,
"bg_count": 0
}, ...
Here's a simplified version of my index mapping:
{
"properties": {
"my_field": {
"type": "nested",
"properties": {
"txt": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
}
},
"pos": {
"type": "keyword"
}
}
}
}
}
To provide more clarity, I'm using the significant_terms
aggregation as follows, where I'm filtering based on the pos
field before performing the aggregation:
json
{
"significant_terms_nested": {
"nested": {
"path": "my_field"
},
"aggs": {
"pos_filter": {
"filter": {
"terms": {
"my_field.pos": ["noun", "verb", "adj"]
}
},
"aggs": {
"significant_terms": {
"significant_terms": {
"field": "my_field.txt.keyword",
"size": 50
}
}
}
}
}
}
}
My primary questions are:
- When using
significant_terms
on a nested field, especially after filtering by a nested field's value (likepos
in my case), do I need to specify to Elasticsearch which field to use for the background search? I'd expect the background scan to consider the entire index without any filters applied. If so, how do I ensure this? - Is it mandatory for a field to be mapped as
text
for thesignificant_terms
aggregation to work properly? Or is it sufficient if a field is only mapped as akeyword
?
Initially, I mapped the field to .txt
with only the keyword
type. After conducting the significant_terms
aggregation, I noticed that the terms returned were not as "significant" as I had expected. I began to wonder if this inconsistency was due to not mapping the field as text
in addition to keyword
. Hoping to get more relevant results, I made the change to include the text
mapping. However, to my disappointment, this alteration didn't bring about any notable difference in the aggregation results.
Also applying this has no effect, bg_count is still 0:
"background_filter": {
"match_all": {}
}
Steps to reproduce:
[
{
"txt": "word1",
"pos": "POS_TYPE"
},
{
"txt": "word2",
"pos": "POS_TYPE"
},
...
{
"txt": "wordN",
"pos": "POS_TYPE"
}
]
- Index documents with the
my_field
field structured as shown above. - Apply the
significant_terms
aggregation using nested and filtered queries on themy_field
field. - Observe the
bg_count
in the aggregation results.
Any insights or guidance on this would be greatly appreciated. Thanks in advance!