Thanks Mark!
I've been planning to look into significant_terms
, but didn't know it
could help me with this. I'm a bit concerned that a too clever scoring
could be hard to explain to users, but I'll give it a shot.
On Tue, Feb 17, 2015 at 9:41 AM, Mark Harwood <
mark.harwood@elasticsearch.com> wrote:
Nice to see someone taking the trouble to put their stats in context.
Drives me nuts every time I see the equivalent of this:
xkcd: HeatmapSo we have a feature that does some of what you are after - it's called
the "significant_terms" aggregation.
Your query would look like this:
{
"query" :
{
"match" : {
"text": "foo"
}
},
"aggs":{
"keywords":{
"significant_terms":{
"field":"country",
"size":100
}
}
}
}What you get back are buckets for each country with a doc_count that
represents how many "foo" documents there were in that country and a
background count called "bg_count" which is how many docs (foo and non foo)
came from that country. Selections are ranked using a score that is
returned and which is more nuanced than a straight doc_count/bg_count
percentage. In practice we find prioritizing selections solely by a
percentage measure can skew results towards very rare terms (in your case v
small countries) that have few data samples and so can more easily achieve
high-scoring percentages. Instead, we offer a variety of scoring heuristics
which place a different emphasis on popular vs rare when it comes to
ranking: (see https://twitter.com/elasticmark/status/513320986956292096 )Cheers
MarkOn Tuesday, February 17, 2015 at 1:07:31 AM UTC, ja...@holderdeord.no
wrote:Hi,
I'm looking for a way to have Elasticsearch calculate the percentage of
docs that match a query within a terms aggregation.
That is, given two aggregations where one is filtered and the other is
not:{
aggregations: {
countries: {
filter: {
query: {
query_string: {
default_field: "description",
query: "foo"
}
}
},
aggregations: {
filteredCountries: {
terms: { field: "country" }
}
}
},
totalCountries: {
terms: { field: "countries" }
}
},
size: 0
}Let's say the totalCountries buckets are:
"buckets": [ { "key": "USA", "doc_count": 100 }, { "key": "UK", "doc_count": 50 } ]
and the filteredCountries buckets are:
"buckets": [ { "key": "USA", "doc_count": 10 }, { "key": "UK", "doc_count": 25 } ]
Is there a way to get a response that returns filteredCountries as
percentages of totalCountries? I.e. something like:[
{
"key": "USA",
"percent": 10
},
{
"key": "UK",
"percent": 50
}
]Thanks!
--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/1ojltqSRdhA/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/5337cd90-a434-4a44-9a81-969e55568389%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/5337cd90-a434-4a44-9a81-969e55568389%40googlegroups.com?utm_medium=email&utm_source=footer
.For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAP4LNbgBjhXyB3rXUPD-nfOg89MsUOLiNSLJtRO78F5WHH9vxA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.