Significant terms aggregation with non tokenized text

I just tried using the significant terms aggregation on two text fields I
have, and noticed that it doesn't seem to work on "non tokenized" fields.
On my keyword tokenized field, I get 0 for the bg_count, and it looks the
same as a regular terms query with slighly different counts. When I used
my regular tokenized query, I see the results differ, and I have bg_counts.
Why is this?

Here are my 2 fields and analyzer:

"properties":{
"query" : {

    "type" : "multi_field",                                             
  
    "fields" : {                                                       
   
        "query"          : { "type" : "string" },                       
  
        "queryUntouched" : { "type" : "string", "analyzer" : 

"myLowercaseAnalyzer" }
}

}

}

"analyzer" : {

"myLowercaseAnalyzer" : {                                               
  
    "tokenizer" : "keyword",                                           
   
    "filter" : ["lowercase"]                                           
   
}                                                                       

}

When I send the significant terms aggregation against queryUntouched it
looks the same as a regular terms agg, with bg_count set to 0:

"aggs": {
"pop": {
"terms": {
"field": "queryUntouched",
"size": 3
}
},
"sig": {
"significant_terms": {
"field": "queryUntouched",
"size": 3
}
}
}

aggregations: {

  • pop: {
    • buckets: [
      • {
        • key: yield curve
        • doc_count: 102
          }
      • {
        • key: gdp
        • doc_count: 70
          }
          ]
          }
  • sig: {
    • doc_count: 62804
    • buckets: [
      • {
        • key: yield curve
        • doc_count: 102
        • score: 7.200895615143776
        • bg_count: 0
          }
      • {
        • key: gdp
        • doc_count: 81
        • score: 4.540783692447051
        • bg_count: 0
          }
          ]
          }

When I use the tokenized field, I get results that I would expect:
"aggs": {
"pop": {
"terms": {
"field": "query",
"size": 2
}
},
"sig": {
"significant_terms": {
"field": "query",
"size": 2
}
}
}

aggregations: {

  • pop: {
    • buckets: [
      • {
        • key: bank
        • doc_count: 1423
          }
      • {
        • key: of
        • doc_count: 641
          }
          ]
          }
  • sig: {
    • doc_count: 62804
    • buckets: [
      • {
        • key: bank
        • doc_count: 1423
        • score: 0.03191767117787348
        • bg_count: 25686
          }
      • {
        • key: id
        • doc_count: 715
        • score: 0.017449718916743313
        • bg_count: 12274
          }
          ]
          }

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/e7a41870-bb42-46f5-9161-dbeb6c847ad2%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Unlike the terms aggs which only accesses the content loaded into RAM (aka
FieldData), the significant_terms agg has to also go to disk to check the
frequency of terms in the index for the background count. This different
datasource means the naming conventions can sometimes differ. Can you try
prefix the field name used by the significant terms with "query" e.g.
"field":"query.queryUnTouched"?

On Friday, September 26, 2014 2:57:10 AM UTC+1, Mike wrote:

I just tried using the significant terms aggregation on two text fields I
have, and noticed that it doesn't seem to work on "non tokenized" fields.
On my keyword tokenized field, I get 0 for the bg_count, and it looks the
same as a regular terms query with slighly different counts. When I used
my regular tokenized query, I see the results differ, and I have bg_counts.
Why is this?

Here are my 2 fields and analyzer:

"properties":{
"query" : {

    "type" : "multi_field",                                           
    
    "fields" : {                                                       
   
        "query"          : { "type" : "string" },                     
    
        "queryUntouched" : { "type" : "string", "analyzer" : 

"myLowercaseAnalyzer" }
}

}

}

"analyzer" : {

"myLowercaseAnalyzer" : {                                             
    
    "tokenizer" : "keyword",                                           
   
    "filter" : ["lowercase"]                                           
   
}                                                                     

}

When I send the significant terms aggregation against queryUntouched it
looks the same as a regular terms agg, with bg_count set to 0:

"aggs": {
"pop": {
"terms": {
"field": "queryUntouched",
"size": 3
}
},
"sig": {
"significant_terms": {
"field": "queryUntouched",
"size": 3
}
}
}

aggregations: {

  • pop: {
    • buckets: [
      • {
        • key: yield curve
        • doc_count: 102
          }
      • {
        • key: gdp
        • doc_count: 70
          }
          ]
          }
  • sig: {
    • doc_count: 62804
    • buckets: [
      • {
        • key: yield curve
        • doc_count: 102
        • score: 7.200895615143776
        • bg_count: 0
          }
      • {
        • key: gdp
        • doc_count: 81
        • score: 4.540783692447051
        • bg_count: 0
          }
          ]
          }

When I use the tokenized field, I get results that I would expect:
"aggs": {
"pop": {
"terms": {
"field": "query",
"size": 2
}
},
"sig": {
"significant_terms": {
"field": "query",
"size": 2
}
}
}

aggregations: {

  • pop: {
    • buckets: [
      • {
        • key: bank
        • doc_count: 1423
          }
      • {
        • key: of
        • doc_count: 641
          }
          ]
          }
  • sig: {
    • doc_count: 62804
    • buckets: [
      • {
        • key: bank
        • doc_count: 1423
        • score: 0.03191767117787348
        • bg_count: 25686
          }
      • {
        • key: id
        • doc_count: 715
        • score: 0.017449718916743313
        • bg_count: 12274
          }
          ]
          }

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/39153e77-d916-4132-8987-a89a88f0b8a5%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.