Accuracy on cardinality aggregate

Henrik_Nordvik · March 28, 2014, 7:17am

Hi,
I'm trying out the new cardinality aggregation, and want to measure the
accuracy on my data. I'm using a dataset of a day of sample tweets (2.8m
tweets).

I'm counting the number of unique usernames per language.
To get my "reference" unique count I use this:
GET /twitter-2014.03.26/_search
{
"size": 0,
"aggs": {
"country_count": {
"terms": {
"field": "lang"
},
"aggs": {
"unique_count" : { "value_count" : { "field" : "screen_name" } }
}
}
}
}

Result:
"aggregations": {
"country_count": {
"buckets": [
{
"key": "en",
"doc_count": 872906,
"unique_count": {
"value": 307489
}
},
{
"key": "ja",
"doc_count": 581521,
"unique_count": {
"value": 103035
}
},

To get the approximate count with cardinality:
GET /twitter-2014.03.26/_search
{
"size": 0,
"aggs": {
"country_count": {
"terms": {
"field": "lang"
},
"aggregations": {
"distinct_users_approx": {
"cardinality": {
"field": "screen_name",
"precision_threshold": 40000
}
}
}
}
}
}

Result:
"aggregations": {
"country_count": {
"buckets": [
{
"key": "en",
"doc_count": 872906,
"distinct_users_approx": {
"value": 145541
}
},
{
"key": "ja",
"doc_count": 581521,
"distinct_users_approx": {
"value": 50824
}
},

So, 307489 vs 145541 for english, and 103035 vs 50824 for japanese. Not
very accurate.

Am I doing the reference unique count distinct correctly?
Is it supposed to be this inaccurate on this type of dataset?
Is there any way to improve precision?

Henrik

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/91eead45-319c-4a72-81a9-bad214a3ee61%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Mark_Harwood_2 · March 28, 2014, 5:54pm

I don't believe value_count is intended to be a unique count.

On Friday, March 28, 2014 7:17:47 AM UTC, Henrik Nordvik wrote:

Hi,
I'm trying out the new cardinality aggregation, and want to measure the
accuracy on my data. I'm using a dataset of a day of sample tweets (2.8m
tweets).

I'm counting the number of unique usernames per language.
To get my "reference" unique count I use this:
GET /twitter-2014.03.26/_search
{
"size": 0,
"aggs": {
"country_count": {
"terms": {
"field": "lang"
},
"aggs": {
"unique_count" : { "value_count" : { "field" : "screen_name" } }
}
}
}
}

Result:
"aggregations": {
"country_count": {
"buckets": [
{
"key": "en",
"doc_count": 872906,
"unique_count": {
"value": 307489
}
},
{
"key": "ja",
"doc_count": 581521,
"unique_count": {
"value": 103035
}
},

To get the approximate count with cardinality:
GET /twitter-2014.03.26/_search
{
"size": 0,
"aggs": {
"country_count": {
"terms": {
"field": "lang"
},
"aggregations": {
"distinct_users_approx": {
"cardinality": {
"field": "screen_name",
"precision_threshold": 40000
}
}
}
}
}
}

Result:
"aggregations": {
"country_count": {
"buckets": [
{
"key": "en",
"doc_count": 872906,
"distinct_users_approx": {
"value": 145541
}
},
{
"key": "ja",
"doc_count": 581521,
"distinct_users_approx": {
"value": 50824
}
},

So, 307489 vs 145541 for english, and 103035 vs 50824 for japanese. Not
very accurate.

Am I doing the reference unique count distinct correctly?

Is it supposed to be this inaccurate on this type of dataset?

Is there any way to improve precision?

Henrik

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/b323f916-81ff-4e98-baa2-e3b0f84fa28e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Henrik_Nordvik · March 28, 2014, 8:38pm

I compared the unique count with the total field of the old terms facet and
it matched. What else would the count be? It is lower than doc count.
On 28 Mar 2014 18:54, "Mark Harwood" mark.harwood@elasticsearch.com wrote:

I don't believe value_count is intended to be a unique count.

On Friday, March 28, 2014 7:17:47 AM UTC, Henrik Nordvik wrote:

Hi,
I'm trying out the new cardinality aggregation, and want to measure the
accuracy on my data. I'm using a dataset of a day of sample tweets (2.8m
tweets).

I'm counting the number of unique usernames per language.
To get my "reference" unique count I use this:
GET /twitter-2014.03.26/_search
{
"size": 0,
"aggs": {
"country_count": {
"terms": {
"field": "lang"
},
"aggs": {
"unique_count" : { "value_count" : { "field" : "screen_name" } }
}
}
}
}

Result:
"aggregations": {
"country_count": {
"buckets": [
{
"key": "en",
"doc_count": 872906,
"unique_count": {
"value": 307489
}
},
{
"key": "ja",
"doc_count": 581521,
"unique_count": {
"value": 103035
}
},

To get the approximate count with cardinality:
GET /twitter-2014.03.26/_search
{
"size": 0,
"aggs": {
"country_count": {
"terms": {
"field": "lang"
},
"aggregations": {
"distinct_users_approx": {
"cardinality": {
"field": "screen_name",
"precision_threshold": 40000
}
}
}
}
}
}

Result:
"aggregations": {
"country_count": {
"buckets": [
{
"key": "en",
"doc_count": 872906,
"distinct_users_approx": {
"value": 145541
}
},
{
"key": "ja",
"doc_count": 581521,
"distinct_users_approx": {
"value": 50824
}
},

So, 307489 vs 145541 for english, and 103035 vs 50824 for japanese. Not
very accurate.

Am I doing the reference unique count distinct correctly?

Is it supposed to be this inaccurate on this type of dataset?

Is there any way to improve precision?

Henrik

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/cy59hCNnT0Q/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/b323f916-81ff-4e98-baa2-e3b0f84fa28e%40googlegroups.com https://groups.google.com/d/msgid/elasticsearch/b323f916-81ff-4e98-baa2-e3b0f84fa28e%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAH3vNzN9ftYTJEnAo3si1GKJk0e2qc%2BRoApXmXB2CB_6bT%3Dysw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Binh_Ly_2 · March 28, 2014, 9:32pm

value_count is the total number of values extracted per bucket. This
example might help:

gist.github.com

https://gist.github.com/bly2k/9843335

gistfile1.txt

curl -XDELETE localhost:9200/test

curl -XPUT localhost:9200/test/doc/1 -d '{ "a": "1" }'

curl -XPUT localhost:9200/test/doc/2 -d '{ "a": "1" }'

curl -XPUT localhost:9200/test/doc/3 -d '{ "a": "1" }'

curl -XPOST "localhost:9200/test/_search?search_type=count&pretty" -d '{
  "aggs": {

This file has been truncated. show original

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/49e9b196-548a-4e8b-86ed-87857d1973d5%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Henrik_Nordvik · March 31, 2014, 9:09pm

Ah, so there is currently not easy way of getting exact unique counts out
of elasticsearch?

I found a manual way of doing it:

curl -s 'http://localhost:9200/twitter-2014.03.26/_search' -d '{ "facets":
{ "a": { "terms": { "field": "screen_name", "size":
200000},"facet_filter": {"query": {"term": {"lang": "en"}}}}},"size": 0}' |
./jq '.facets.a.terms | length'
145474 (vs 145541)
curl -s 'http://localhost:9200/twitter-2014.03.26/_search' -d '{ "facets":
{ "a": { "terms": { "field": "screen_name", "size":
200000},"facet_filter": {"query": {"term": {"lang": "ja"}}}}},"size": 0}' |
./jq '.facets.a.terms | length'
50949 (vs 50824)

So the count is quite close! Thank you.

On Friday, March 28, 2014 10:32:55 PM UTC+1, Binh Ly wrote:

value_count is the total number of values extracted per bucket. This
example might help:

Elasticsearch value_count and cardinality · GitHub

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/8669e9f0-eece-4b77-8e99-fec483359e2f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

jpountz · April 1, 2014, 7:50am

Hi Henrik,

Indeed, there is no way to compute exact unique counts. The reason why we
don't expose such a feature is that it would be very costly. In your case,
the cardinality is not too large so the terms aggregation helped compute
the number of unique values but if the actual cardinality had been very
large (eg. 100M), it is very likely that trying to use the terms agg to do
so would have required a lot of memory (maybe triggering out-of-memory
errors on your nodes), been very slow and caused a lot of network traffic.
We will try to clarify this through documentation or a blog post soon.

Thanks for trying out this new aggregation!

On Mon, Mar 31, 2014 at 11:09 PM, Henrik Nordvik henrikno@gmail.com wrote:

Ah, so there is currently not easy way of getting exact unique counts out
of elasticsearch?

I found a manual way of doing it:

curl -s 'http://localhost:9200/twitter-2014.03.26/_search' -d '{
"facets": { "a": { "terms": { "field": "screen_name", "size":
200000},"facet_filter": {"query": {"term": {"lang": "en"}}}}},"size": 0}' |
./jq '.facets.a.terms | length'
145474 (vs 145541)
curl -s 'http://localhost:9200/twitter-2014.03.26/_search' -d '{
"facets": { "a": { "terms": { "field": "screen_name", "size":
200000},"facet_filter": {"query": {"term": {"lang": "ja"}}}}},"size": 0}' |
./jq '.facets.a.terms | length'
50949 (vs 50824)

So the count is quite close! Thank you.

On Friday, March 28, 2014 10:32:55 PM UTC+1, Binh Ly wrote:

value_count is the total number of values extracted per bucket. This
example might help:

Elasticsearch value_count and cardinality · GitHub

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/8669e9f0-eece-4b77-8e99-fec483359e2f%40googlegroups.com https://groups.google.com/d/msgid/elasticsearch/8669e9f0-eece-4b77-8e99-fec483359e2f%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAL6Z4j7Qxe0SJSfFreK%3DfpqSBfziLzTVoGgi-T73J1YDx6ApTQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Dror_Atariah · November 25, 2014, 1:29pm

Hi Adrien,

I have two comments/questions:

For me, the documentation is still somehow confusing, and the difference
between the cardinality and value_count aggregations is not 100% clear.
When it comes to counting unique values: I believe that the only way
that one can take, at the moment, is to use the cardinality aggregation.
This, however, comes with the price of an approximated result (as discussed
in the documentation and in the paper describing HyperLogLog++). I
understand the need to take an approximating approach; but I think that the
returned result should indicate a bound on the error. Otherwise, the
returned count could be considered useless. In the documentation the figure
5% is mentioned --- is it independent of the cardinality? what happens to
this bound when the precision threshold is >> 40,000?

Thanks for your time,
Dror

On Tuesday, April 1, 2014 9:50:30 AM UTC+2, Adrien Grand wrote:

Hi Henrik,

Indeed, there is no way to compute exact unique counts. The reason why we
don't expose such a feature is that it would be very costly. In your case,
the cardinality is not too large so the terms aggregation helped compute
the number of unique values but if the actual cardinality had been very
large (eg. 100M), it is very likely that trying to use the terms agg to do
so would have required a lot of memory (maybe triggering out-of-memory
errors on your nodes), been very slow and caused a lot of network traffic.
We will try to clarify this through documentation or a blog post soon.

Thanks for trying out this new aggregation!

On Mon, Mar 31, 2014 at 11:09 PM, Henrik Nordvik <henr...@gmail.com
<javascript:>> wrote:

Ah, so there is currently not easy way of getting exact unique counts out
of elasticsearch?

I found a manual way of doing it:

curl -s 'http://localhost:9200/twitter-2014.03.26/_search' -d '{
"facets": { "a": { "terms": { "field": "screen_name", "size":
200000},"facet_filter": {"query": {"term": {"lang": "en"}}}}},"size": 0}' |
./jq '.facets.a.terms | length'
145474 (vs 145541)
curl -s 'http://localhost:9200/twitter-2014.03.26/_search' -d '{
"facets": { "a": { "terms": { "field": "screen_name", "size":
200000},"facet_filter": {"query": {"term": {"lang": "ja"}}}}},"size": 0}' |
./jq '.facets.a.terms | length'
50949 (vs 50824)

So the count is quite close! Thank you.

On Friday, March 28, 2014 10:32:55 PM UTC+1, Binh Ly wrote:

value_count is the total number of values extracted per bucket. This
example might help:

Elasticsearch value_count and cardinality · GitHub

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/8669e9f0-eece-4b77-8e99-fec483359e2f%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/8669e9f0-eece-4b77-8e99-fec483359e2f%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/96f6d854-466b-46a2-8387-64e785db95e9%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

jpountz · November 25, 2014, 5:41pm

Hi Dror,

On Tue, Nov 25, 2014 at 2:29 PM, Dror Atariah drorata@gmail.com wrote:

Hi Adrien,

I have two comments/questions:

For me, the documentation is still somehow confusing, and the
difference between the cardinality and value_count aggregations is
not 100% clear.

I have to agree here... If you have suggestions to make it less confusing,
ideas are highly welcome (even changing the name of the aggs might be an
option if we do it in a major release).

When it comes to counting unique values: I believe that the only way
that one can take, at the moment, is to use the cardinality aggregation.
This, however, comes with the price of an approximated result (as discussed
in the documentation and in the paper describing HyperLogLog++). I
understand the need to take an approximating approach; but I think that the
returned result should indicate a bound on the error. Otherwise, the
returned count could be considered useless. In the documentation the figure
5% is mentioned --- is it independent of the cardinality? what happens to
this bound when the precision threshold is >> 40,000?

This is true, only the cardinality aggregation allows to compute unique
counts.

The thing about the error is that there is no bound on it, but higher
errors are less likely. The only thing we might be able to return would
be a condifence interval, but it requires some work... Regarding the 5%
that are mentioned in the documentation, it was just meant as an example to
show that in spite of the approximate approach, results are very close to
accurate. A precision_threshold above 40000 is basically the same as a
precision_threshold of 40000.

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAL6Z4j6-s1dM%2BuYpLDTn_tFfpxevYZmu_3_zvaRiXKwuZi2vOw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Dror_Atariah · November 25, 2014, 7:19pm

Thanks for your quick reply!

On Tue, Nov 25, 2014 at 6:41 PM, Adrien Grand <
adrien.grand@elasticsearch.com> wrote:

On Tue, Nov 25, 2014 at 2:29 PM, Dror Atariah drorata@gmail.com wrote:

For me, the documentation is still somehow confusing, and the
difference between the cardinality and value_count aggregations is
not 100% clear.

I have to agree here... If you have suggestions to make it less confusing,
ideas are highly welcome (even changing the name of the aggs might be an
option if we do it in a major release).

Well, name changing is problematic, due to backwards compatibilities and
should be exercised only as the last resort. Beforehand, I'd suggest to add
a section, common to the two aggregations, where there's a single
(minimal) example that demonstrated the differences.

When it comes to counting unique values: I believe that the only way

that one can take, at the moment, is to use the cardinality aggregation.
This, however, comes with the price of an approximated result (as discussed
in the documentation and in the paper describing HyperLogLog++). I
understand the need to take an approximating approach; but I think that the
returned result should indicate a bound on the error. Otherwise, the
returned count could be considered useless. In the documentation the figure
5% is mentioned --- is it independent of the cardinality? what happens to
this bound when the precision threshold is >> 40,000?

This is true, only the cardinality aggregation allows to compute unique
counts.

The thing about the error is that there is no bound on it, but higher
errors are less likely. The only thing we might be able to return would
be a condifence interval, but it requires some work... Regarding the 5%
that are mentioned in the documentation, it was just meant as an example to
show that in spite of the approximate approach, results are very close to
accurate. A precision_threshold above 40000 is basically the same as a
precision_threshold of 40000.

If there's no theoretical bound, then I guess the best one can hope for is
the probability that the returned value is outside an \epsilon interval
(what you probably refer to as "confidence interval"). This would be great,
not to say absolutely necessary. When a data scientist presents his work,
the business (narrow minded) guys want to know the numbers...

Furthermore, since ES aims for big data, it is not clear to me how one can
come up with the 40,000 figure. After all, if the number of unique values
is in the order of 1K or 100M, then the threshold cannot be the same... can
it?

Best,
Dror

PS: thanks for the interesting discussion!

--
Adrien Grand

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/cy59hCNnT0Q/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAL6Z4j6-s1dM%2BuYpLDTn_tFfpxevYZmu_3_zvaRiXKwuZi2vOw%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAL6Z4j6-s1dM%2BuYpLDTn_tFfpxevYZmu_3_zvaRiXKwuZi2vOw%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
Dror Atariah, Ph.D.
de.linkedin.com/in/atariah

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CANfRcg3pHpTUMcbdgUuETk-LC0Z%3DOkA8fWzVh1BUZB7iULjH_w%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Topic		Replies	Views
Get number of unique values in a field Elasticsearch	3	1026	July 6, 2017
Cardinality Aggregation gives wrong number? Elasticsearch	33	7349	March 7, 2019
Precise distinct count Elasticsearch	1	292	October 19, 2020
Cardinality is more than Count. How to achieve the exact uniq count? Elasticsearch	7	2179	July 5, 2017
Cardinality Aggregation - Different Unique Counts! Elasticsearch	18	4616	July 6, 2017

Accuracy on cardinality aggregate

Related topics