Combining two aggregations to get term percentage

jarib · February 17, 2015, 1:07am

Hi,

I'm looking for a way to have Elasticsearch calculate the percentage of
docs that match a query within a terms aggregation.
That is, given two aggregations where one is filtered and the other is not:

{
aggregations: {
countries: {
filter: {
query: {
query_string: {
default_field: "description",
query: "foo"
}
}
},
aggregations: {
filteredCountries: {
terms: { field: "country" }
}
}
},
totalCountries: {
terms: { field: "countries" }
}
},
size: 0
}

Let's say the totalCountries buckets are:

"buckets": [
    {
        "key": "USA",
        "doc_count": 100
    },
    {
        "key": "UK",
        "doc_count": 50
    }
]

and the filteredCountries buckets are:

"buckets": [
    {
        "key": "USA",
        "doc_count": 10
    },
    {
        "key": "UK",
        "doc_count": 25
    }
]

Is there a way to get a response that returns filteredCountries as
percentages of totalCountries? I.e. something like:

[
{
"key": "USA",
"percent": 10
},
{
"key": "UK",
"percent": 50
}
]

Thanks!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/8bbdff97-e2a0-415e-ba4f-f418a279be27%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Mark_Harwood_2 · February 17, 2015, 9:41am

Nice to see someone taking the trouble to put their stats in context.
Drives me nuts every time I see the equivalent of this:

So we have a feature that does some of what you are after - it's called the
"significant_terms" aggregation.
Your query would look like this:
{
"query" :
{
"match" : {
"text": "foo"
}
},
"aggs":{
"keywords":{
"significant_terms":{
"field":"country",
"size":100
}
}
}
}

What you get back are buckets for each country with a doc_count that
represents how many "foo" documents there were in that country and a
background count called "bg_count" which is how many docs (foo and non foo)
came from that country. Selections are ranked using a score that is
returned and which is more nuanced than a straight doc_count/bg_count
percentage. In practice we find prioritizing selections solely by a
percentage measure can skew results towards very rare terms (in your case v
small countries) that have few data samples and so can more easily achieve
high-scoring percentages. Instead, we offer a variety of scoring heuristics
which place a different emphasis on popular vs rare when it comes to
ranking: (see https://twitter.com/elasticmark/status/513320986956292096 )

Cheers
Mark

On Tuesday, February 17, 2015 at 1:07:31 AM UTC, ja...@holderdeord.no wrote:

Hi,

I'm looking for a way to have Elasticsearch calculate the percentage of
docs that match a query within a terms aggregation.
That is, given two aggregations where one is filtered and the other is not:

{
aggregations: {
countries: {
filter: {
query: {
query_string: {
default_field: "description",
query: "foo"
}
}
},
aggregations: {
filteredCountries: {
terms: { field: "country" }
}
}
},
totalCountries: {
terms: { field: "countries" }
}
},
size: 0
}

Let's say the totalCountries buckets are:
"buckets": [
    {
        "key": "USA",
        "doc_count": 100
    },
    {
        "key": "UK",
        "doc_count": 50
    }
]
and the filteredCountries buckets are:
"buckets": [
    {
        "key": "USA",
        "doc_count": 10
    },
    {
        "key": "UK",
        "doc_count": 25
    }
]
Is there a way to get a response that returns filteredCountries as
percentages of totalCountries? I.e. something like:

[
{
"key": "USA",
"percent": 10
},
{
"key": "UK",
"percent": 50
}
]

Thanks!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/5337cd90-a434-4a44-9a81-969e55568389%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

jarib · February 17, 2015, 10:43am

Thanks Mark!

I've been planning to look into significant_terms, but didn't know it
could help me with this. I'm a bit concerned that a too clever scoring
could be hard to explain to users, but I'll give it a shot.

On Tue, Feb 17, 2015 at 9:41 AM, Mark Harwood <
mark.harwood@elasticsearch.com> wrote:

Nice to see someone taking the trouble to put their stats in context.
Drives me nuts every time I see the equivalent of this:
xkcd: Heatmap

So we have a feature that does some of what you are after - it's called
the "significant_terms" aggregation.
Your query would look like this:
{
"query" :
{
"match" : {
"text": "foo"
}
},
"aggs":{
"keywords":{
"significant_terms":{
"field":"country",
"size":100
}
}
}
}

What you get back are buckets for each country with a doc_count that
represents how many "foo" documents there were in that country and a
background count called "bg_count" which is how many docs (foo and non foo)
came from that country. Selections are ranked using a score that is
returned and which is more nuanced than a straight doc_count/bg_count
percentage. In practice we find prioritizing selections solely by a
percentage measure can skew results towards very rare terms (in your case v
small countries) that have few data samples and so can more easily achieve
high-scoring percentages. Instead, we offer a variety of scoring heuristics
which place a different emphasis on popular vs rare when it comes to
ranking: (see https://twitter.com/elasticmark/status/513320986956292096 )

Cheers
Mark

On Tuesday, February 17, 2015 at 1:07:31 AM UTC, ja...@holderdeord.no
wrote:
Hi,

I'm looking for a way to have Elasticsearch calculate the percentage of
docs that match a query within a terms aggregation.
That is, given two aggregations where one is filtered and the other is
not:

{
aggregations: {
countries: {
filter: {
query: {
query_string: {
default_field: "description",
query: "foo"
}
}
},
aggregations: {
filteredCountries: {
terms: { field: "country" }
}
}
},
totalCountries: {
terms: { field: "countries" }
}
},
size: 0
}

Let's say the totalCountries buckets are:
"buckets": [
    {
        "key": "USA",
        "doc_count": 100
    },
    {
        "key": "UK",
        "doc_count": 50
    }
]
and the filteredCountries buckets are:
"buckets": [
    {
        "key": "USA",
        "doc_count": 10
    },
    {
        "key": "UK",
        "doc_count": 25
    }
]
Is there a way to get a response that returns filteredCountries as
percentages of totalCountries? I.e. something like:

[
{
"key": "USA",
"percent": 10
},
{
"key": "UK",
"percent": 50
}
]

Thanks!
--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/1ojltqSRdhA/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/5337cd90-a434-4a44-9a81-969e55568389%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/5337cd90-a434-4a44-9a81-969e55568389%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAP4LNbgBjhXyB3rXUPD-nfOg89MsUOLiNSLJtRO78F5WHH9vxA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Mark_Harwood_2 · February 17, 2015, 10:52am

You can choose to ignore the score and compute your own by dividing
doc_count by bg_count.

Your post has made me think we should add this more easily explainable
metric as one of the scoring heuristics we offer for this aggregation.

On Tuesday, February 17, 2015 at 10:44:12 AM UTC, Jari Bakken wrote:

Thanks Mark!

I've been planning to look into significant_terms, but didn't know it
could help me with this. I'm a bit concerned that a too clever scoring
could be hard to explain to users, but I'll give it a shot.

On Tue, Feb 17, 2015 at 9:41 AM, Mark Harwood <mark.h...@elasticsearch.com
<javascript:>> wrote:
Nice to see someone taking the trouble to put their stats in context.
Drives me nuts every time I see the equivalent of this:
xkcd: Heatmap

So we have a feature that does some of what you are after - it's called
the "significant_terms" aggregation.
Your query would look like this:
{
"query" :
{
"match" : {
"text": "foo"
}
},
"aggs":{
"keywords":{
"significant_terms":{
"field":"country",
"size":100
}
}
}
}

What you get back are buckets for each country with a doc_count that
represents how many "foo" documents there were in that country and a
background count called "bg_count" which is how many docs (foo and non foo)
came from that country. Selections are ranked using a score that is
returned and which is more nuanced than a straight doc_count/bg_count
percentage. In practice we find prioritizing selections solely by a
percentage measure can skew results towards very rare terms (in your case v
small countries) that have few data samples and so can more easily achieve
high-scoring percentages. Instead, we offer a variety of scoring heuristics
which place a different emphasis on popular vs rare when it comes to
ranking: (see https://twitter.com/elasticmark/status/513320986956292096 )

Cheers
Mark

On Tuesday, February 17, 2015 at 1:07:31 AM UTC, ja...@holderdeord.no
wrote:
Hi,

I'm looking for a way to have Elasticsearch calculate the percentage of
docs that match a query within a terms aggregation.
That is, given two aggregations where one is filtered and the other is
not:

{
aggregations: {
countries: {
filter: {
query: {
query_string: {
default_field: "description",
query: "foo"
}
}
},
aggregations: {
filteredCountries: {
terms: { field: "country" }
}
}
},
totalCountries: {
terms: { field: "countries" }
}
},
size: 0
}

Let's say the totalCountries buckets are:
"buckets": [
    {
        "key": "USA",
        "doc_count": 100
    },
    {
        "key": "UK",
        "doc_count": 50
    }
]
and the filteredCountries buckets are:
"buckets": [
    {
        "key": "USA",
        "doc_count": 10
    },
    {
        "key": "UK",
        "doc_count": 25
    }
]
Is there a way to get a response that returns filteredCountries as
percentages of totalCountries? I.e. something like:

[
{
"key": "USA",
"percent": 10
},
{
"key": "UK",
"percent": 50
}
]

Thanks!
--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/1ojltqSRdhA/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/5337cd90-a434-4a44-9a81-969e55568389%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/5337cd90-a434-4a44-9a81-969e55568389%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/efc841d3-7c1a-4f8f-afa2-2f6474261085%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

jarib · February 17, 2015, 12:11pm

Yes!

If I have to do the division on my own I might as well stick with the two
aggregations, AFAICT.

But if it was available as a scoring heuristic I could effectively use {size:
N} so I don’t have to fetch the full set of countries to do this
calculation.

I’ve opened a feature request here
https://github.com/elasticsearch/elasticsearch/issues/9720.

On Tue, Feb 17, 2015 at 10:52 AM, Mark Harwood <
mark.harwood@elasticsearch.com> wrote:

You can choose to ignore the score and compute your own by dividing
doc_count by bg_count.

Your post has made me think we should add this more easily explainable
metric as one of the scoring heuristics we offer for this aggregation.

On Tuesday, February 17, 2015 at 10:44:12 AM UTC, Jari Bakken wrote:
Thanks Mark!

I've been planning to look into significant_terms, but didn't know it
could help me with this. I'm a bit concerned that a too clever scoring
could be hard to explain to users, but I'll give it a shot.

On Tue, Feb 17, 2015 at 9:41 AM, Mark Harwood <mark.h...@elasticsearch.
com> wrote:
Nice to see someone taking the trouble to put their stats in context.
Drives me nuts every time I see the equivalent of this:
xkcd: Heatmap

So we have a feature that does some of what you are after - it's called
the "significant_terms" aggregation.
Your query would look like this:
{
"query" :
{
"match" : {
"text": "foo"
}
},
"aggs":{
"keywords":{
"significant_terms":{
"field":"country",
"size":100
}
}
}
}

What you get back are buckets for each country with a doc_count that
represents how many "foo" documents there were in that country and a
background count called "bg_count" which is how many docs (foo and non foo)
came from that country. Selections are ranked using a score that is
returned and which is more nuanced than a straight doc_count/bg_count
percentage. In practice we find prioritizing selections solely by a
percentage measure can skew results towards very rare terms (in your case v
small countries) that have few data samples and so can more easily achieve
high-scoring percentages. Instead, we offer a variety of scoring heuristics
which place a different emphasis on popular vs rare when it comes to
ranking: (see https://twitter.com/elasticmark/status/513320986956292096
)

Cheers
Mark

On Tuesday, February 17, 2015 at 1:07:31 AM UTC, ja...@holderdeord.no
wrote:
Hi,

I'm looking for a way to have Elasticsearch calculate the percentage of
docs that match a query within a terms aggregation.
That is, given two aggregations where one is filtered and the other is
not:

{
aggregations: {
countries: {
filter: {
query: {
query_string: {
default_field: "description",
query: "foo"
}
}
},
aggregations: {
filteredCountries: {
terms: { field: "country" }
}
}
},
totalCountries: {
terms: { field: "countries" }
}
},
size: 0
}

Let's say the totalCountries buckets are:
"buckets": [
    {
        "key": "USA",
        "doc_count": 100
    },
    {
        "key": "UK",
        "doc_count": 50
    }
]
and the filteredCountries buckets are:
"buckets": [
    {
        "key": "USA",
        "doc_count": 10
    },
    {
        "key": "UK",
        "doc_count": 25
    }
]
Is there a way to get a response that returns filteredCountries as
percentages of totalCountries? I.e. something like:

[
{
"key": "USA",
"percent": 10
},
{
"key": "UK",
"percent": 50
}
]

Thanks!
--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/
topic/elasticsearch/1ojltqSRdhA/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/5337cd90-a434-4a44-9a81-969e55568389%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/5337cd90-a434-4a44-9a81-969e55568389%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/1ojltqSRdhA/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/efc841d3-7c1a-4f8f-afa2-2f6474261085%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/efc841d3-7c1a-4f8f-afa2-2f6474261085%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAP4LNbiKSR4jcPCHYvidqFJniyyuVgbXorQ8AKr_qKrJdk1V8A%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Topic		Replies	Views
Calculating percent Elasticsearch	2	1414	February 4, 2022
Query to determine percentage Elasticsearch	1	291	July 2, 2020
Using doc_count to calculate percentage after aggregation Elasticsearch	4	3944	September 9, 2020
How to find the percentage for any query success? Elasticsearch	2	371	November 21, 2022
How to find percentage using doc_count in elastic query Elasticsearch	1	352	March 27, 2020

Combining two aggregations to get term percentage

Related topics