Combining two aggregations to get term percentage

Thanks Mark!

I've been planning to look into significant_terms, but didn't know it
could help me with this. I'm a bit concerned that a too clever scoring
could be hard to explain to users, but I'll give it a shot.

On Tue, Feb 17, 2015 at 9:41 AM, Mark Harwood <
mark.harwood@elasticsearch.com> wrote:

Nice to see someone taking the trouble to put their stats in context.
Drives me nuts every time I see the equivalent of this:
xkcd: Heatmap

So we have a feature that does some of what you are after - it's called
the "significant_terms" aggregation.
Your query would look like this:
{
"query" :
{
"match" : {
"text": "foo"
}
},
"aggs":{
"keywords":{
"significant_terms":{
"field":"country",
"size":100
}
}
}
}

What you get back are buckets for each country with a doc_count that
represents how many "foo" documents there were in that country and a
background count called "bg_count" which is how many docs (foo and non foo)
came from that country. Selections are ranked using a score that is
returned and which is more nuanced than a straight doc_count/bg_count
percentage. In practice we find prioritizing selections solely by a
percentage measure can skew results towards very rare terms (in your case v
small countries) that have few data samples and so can more easily achieve
high-scoring percentages. Instead, we offer a variety of scoring heuristics
which place a different emphasis on popular vs rare when it comes to
ranking: (see https://twitter.com/elasticmark/status/513320986956292096 )

Cheers
Mark

On Tuesday, February 17, 2015 at 1:07:31 AM UTC, ja...@holderdeord.no
wrote:

Hi,

I'm looking for a way to have Elasticsearch calculate the percentage of
docs that match a query within a terms aggregation.
That is, given two aggregations where one is filtered and the other is
not:

{
aggregations: {
countries: {
filter: {
query: {
query_string: {
default_field: "description",
query: "foo"
}
}
},
aggregations: {
filteredCountries: {
terms: { field: "country" }
}
}
},
totalCountries: {
terms: { field: "countries" }
}
},
size: 0
}

Let's say the totalCountries buckets are:

"buckets": [
    {
        "key": "USA",
        "doc_count": 100
    },
    {
        "key": "UK",
        "doc_count": 50
    }
]

and the filteredCountries buckets are:

"buckets": [
    {
        "key": "USA",
        "doc_count": 10
    },
    {
        "key": "UK",
        "doc_count": 25
    }
]

Is there a way to get a response that returns filteredCountries as
percentages of totalCountries? I.e. something like:

[
{
"key": "USA",
"percent": 10
},
{
"key": "UK",
"percent": 50
}
]

Thanks!

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/1ojltqSRdhA/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/5337cd90-a434-4a44-9a81-969e55568389%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/5337cd90-a434-4a44-9a81-969e55568389%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAP4LNbgBjhXyB3rXUPD-nfOg89MsUOLiNSLJtRO78F5WHH9vxA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.