Is there a way to sort aggregation buckets based on word count instead of the default doc_count?
For example, a color field might contain:
"red blue green table"
"red red red red table"
"blue yellow table"
"blue table"
After aggregating on each of the terms, ES will return the values based on doc_count in the following order:
blue (3), red (2), green (1), yellow (1)
What I want is to count the number of occurrences within the color field to get the following:
red (5), blue (3), green(1), yellow (1)
This is my current query that produces the first example:
Edit: I've noticed that significant_term aggregation produces additional info, such as bg_count (which is the number of term occurrences if I understood it well?), so can a sort be used based on bg_count?
Do you have any tips for the scripted metric aggregation?
I'm either still too unexperienced to do this or it cannot be done via Elasticsearch, it looks to me that there is no way to implement this
Generally speaking the doc frequency counts (DF) are adequate for this sort of ranking as opposed to summing the term frequency usages within all docs (TF). If a term is likely to occur many times in a doc it follows that it will be more likely to occur in many docs.
For example - the word the appears twice in the paragraph above and also appears in many docs on this site. We don't need TF to figure that out. Summing TF also means you can be skewed by outliers like the spam doc that mentions #BuyCheapMeds a million times.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.