Significant terms aggregation with non tokenized text

Mike · September 26, 2014, 1:57am

I just tried using the significant terms aggregation on two text fields I
have, and noticed that it doesn't seem to work on "non tokenized" fields.
On my keyword tokenized field, I get 0 for the bg_count, and it looks the
same as a regular terms query with slighly different counts. When I used
my regular tokenized query, I see the results differ, and I have bg_counts.
Why is this?

Here are my 2 fields and analyzer:

"properties":{
"query" : {

    "type" : "multi_field",                                             
  
    "fields" : {                                                       
   
        "query"          : { "type" : "string" },                       
  
        "queryUntouched" : { "type" : "string", "analyzer" :

"myLowercaseAnalyzer" }
}

}

"analyzer" : {

"myLowercaseAnalyzer" : {                                               
  
    "tokenizer" : "keyword",                                           
   
    "filter" : ["lowercase"]                                           
   
}

}

When I send the significant terms aggregation against queryUntouched it
looks the same as a regular terms agg, with bg_count set to 0:

"aggs": {
"pop": {
"terms": {
"field": "queryUntouched",
"size": 3
}
},
"sig": {
"significant_terms": {
"field": "queryUntouched",
"size": 3
}
}
}

aggregations: {

pop: {
- buckets: [
  - {
    - key: yield curve
    - doc_count: 102
      }
  - {
    - key: gdp
    - doc_count: 70
      }
      ]
      }
sig: {
- doc_count: 62804
- buckets: [
  - {
    - key: yield curve
    - doc_count: 102
    - score: 7.200895615143776
    - bg_count: 0
      }
  - {
    - key: gdp
    - doc_count: 81
    - score: 4.540783692447051
    - bg_count: 0
      }
      ]
      }

When I use the tokenized field, I get results that I would expect:
"aggs": {
"pop": {
"terms": {
"field": "query",
"size": 2
}
},
"sig": {
"significant_terms": {
"field": "query",
"size": 2
}
}
}

aggregations: {

pop: {
- buckets: [
  - {
    - key: bank
    - doc_count: 1423
      }
  - {
    - key: of
    - doc_count: 641
      }
      ]
      }
sig: {
- doc_count: 62804
- buckets: [
  - {
    - key: bank
    - doc_count: 1423
    - score: 0.03191767117787348
    - bg_count: 25686
      }
  - {
    - key: id
    - doc_count: 715
    - score: 0.017449718916743313
    - bg_count: 12274
      }
      ]
      }

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/e7a41870-bb42-46f5-9161-dbeb6c847ad2%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Mark_Harwood_2 · September 26, 2014, 9:48am

Unlike the terms aggs which only accesses the content loaded into RAM (aka
FieldData), the significant_terms agg has to also go to disk to check the
frequency of terms in the index for the background count. This different
datasource means the naming conventions can sometimes differ. Can you try
prefix the field name used by the significant terms with "query" e.g.
"field":"query.queryUnTouched"?

On Friday, September 26, 2014 2:57:10 AM UTC+1, Mike wrote:

I just tried using the significant terms aggregation on two text fields I
have, and noticed that it doesn't seem to work on "non tokenized" fields.
On my keyword tokenized field, I get 0 for the bg_count, and it looks the
same as a regular terms query with slighly different counts. When I used
my regular tokenized query, I see the results differ, and I have bg_counts.
Why is this?

Here are my 2 fields and analyzer:

"properties":{
"query" : {
    "type" : "multi_field",                                           
    
    "fields" : {                                                       
   
        "query"          : { "type" : "string" },                     
    
        "queryUntouched" : { "type" : "string", "analyzer" : 
"myLowercaseAnalyzer" }
}
}
}

"analyzer" : {
"myLowercaseAnalyzer" : {                                             
    
    "tokenizer" : "keyword",                                           
   
    "filter" : ["lowercase"]                                           
   
}                                                                     
}

When I send the significant terms aggregation against queryUntouched it
looks the same as a regular terms agg, with bg_count set to 0:

"aggs": {
"pop": {
"terms": {
"field": "queryUntouched",
"size": 3
}
},
"sig": {
"significant_terms": {
"field": "queryUntouched",
"size": 3
}
}
}

aggregations: {

pop: {

buckets: [

{

key: yield curve

doc_count: 102
}

{

key: gdp

doc_count: 70
}
]
}

sig: {

doc_count: 62804

buckets: [

{

key: yield curve

doc_count: 102

score: 7.200895615143776

bg_count: 0
}

{

key: gdp

doc_count: 81

score: 4.540783692447051

bg_count: 0
}
]
}

When I use the tokenized field, I get results that I would expect:
"aggs": {
"pop": {
"terms": {
"field": "query",
"size": 2
}
},
"sig": {
"significant_terms": {
"field": "query",
"size": 2
}
}
}

aggregations: {

pop: {

buckets: [

{

key: bank

doc_count: 1423
}

{

key: of

doc_count: 641
}
]
}

sig: {

doc_count: 62804

buckets: [

{

key: bank

doc_count: 1423

score: 0.03191767117787348

bg_count: 25686
}

{

key: id

doc_count: 715

score: 0.017449718916743313

bg_count: 12274
}
]
}

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/39153e77-d916-4132-8987-a89a88f0b8a5%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Topic		Replies	Views
Significant Terms Aggs: bg_count equals zero Elasticsearch	2	797	July 5, 2017
Bg_counts in nested significant_terms aggregation Elasticsearch	3	1278	July 5, 2017
Doing a significant text aggregation with a custom analyzer Elasticsearch	6	353	September 28, 2022
Terms aggregation ignoring analyzers? Elasticsearch	4	458	June 1, 2018
Detail questions about significant_terms aggregation Elasticsearch	1	322	July 6, 2017

Significant terms aggregation with non tokenized text

Related topics