Sort aggregation on word frequency count instead of doc_count


#1

Hello,

Is there a way to sort aggregation buckets based on word count instead of the default doc_count?
For example, a color field might contain:
"red blue green table"
"red red red red table"
"blue yellow table"
"blue table"

After aggregating on each of the terms, ES will return the values based on doc_count in the following order:
blue (3), red (2), green (1), yellow (1)

What I want is to count the number of occurrences within the color field to get the following:
red (5), blue (3), green(1), yellow (1)

This is my current query that produces the first example:

GET myindex/_search
{
  "query": {
    "query_string": {
      "default_field": "color", 
      "query" : "*table*"
    }
  },
  "aggs" : {
      "clusters" : {
          "terms" : { 
            "field" : "color",
            "exclude": ".*table.*", 
            "size" : 10
          }
      }
  },
  "size" : 0
}

Edit: I've noticed that significant_term aggregation produces additional info, such as bg_count (which is the number of term occurrences if I understood it well?), so can a sort be used based on bg_count?


(Mark Harwood) #2

No. That is is a document count not a word-usage count. To achieve what you are after you may need to use a scripted-metric aggregation


#3

How is bg_count calculated? Can it be used for sorting buckets?


(Mark Harwood) #4

That's described in the docs

No, and it wouldn't help solve your particular problem if you could.


#5

Do you have any tips for the scripted metric aggregation?
I'm either still too unexperienced to do this or it cannot be done via Elasticsearch, it looks to me that there is no way to implement this


(Mark Harwood) #6

It's essentially a coding challenge. Writing custom code to gather then fuse collections of data and counts using the painless language.

Before diving down that rabbit hole - what is the business problem you're trying to solve?


#7

I pretty much need to sort the number of occurrences of each word to provide for a functionality for our platform


(Mark Harwood) #8

Generally speaking the doc frequency counts (DF) are adequate for this sort of ranking as opposed to summing the term frequency usages within all docs (TF). If a term is likely to occur many times in a doc it follows that it will be more likely to occur in many docs.
For example - the word the appears twice in the paragraph above and also appears in many docs on this site. We don't need TF to figure that out. Summing TF also means you can be skewed by outliers like the spam doc that mentions #BuyCheapMeds a million times.


#9

Alright, makes sense yes, one problem less for me, thank you! :slight_smile:


(system) #10

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.