Accuracy on cardinality aggregate

Hi,
I'm trying out the new cardinality aggregation, and want to measure the
accuracy on my data. I'm using a dataset of a day of sample tweets (2.8m
tweets).

I'm counting the number of unique usernames per language.
To get my "reference" unique count I use this:
GET /twitter-2014.03.26/_search
{
"size": 0,
"aggs": {
"country_count": {
"terms": {
"field": "lang"
},
"aggs": {
"unique_count" : { "value_count" : { "field" : "screen_name" } }
}
}
}
}

Result:
"aggregations": {
"country_count": {
"buckets": [
{
"key": "en",
"doc_count": 872906,
"unique_count": {
"value": 307489
}
},
{
"key": "ja",
"doc_count": 581521,
"unique_count": {
"value": 103035
}
},

To get the approximate count with cardinality:
GET /twitter-2014.03.26/_search
{
"size": 0,
"aggs": {
"country_count": {
"terms": {
"field": "lang"
},
"aggregations": {
"distinct_users_approx": {
"cardinality": {
"field": "screen_name",
"precision_threshold": 40000
}
}
}
}
}
}

Result:
"aggregations": {
"country_count": {
"buckets": [
{
"key": "en",
"doc_count": 872906,
"distinct_users_approx": {
"value": 145541
}
},
{
"key": "ja",
"doc_count": 581521,
"distinct_users_approx": {
"value": 50824
}
},

So, 307489 vs 145541 for english, and 103035 vs 50824 for japanese. Not
very accurate.

  1. Am I doing the reference unique count distinct correctly?
  2. Is it supposed to be this inaccurate on this type of dataset?
  3. Is there any way to improve precision?

Henrik

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/91eead45-319c-4a72-81a9-bad214a3ee61%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

I don't believe value_count is intended to be a unique count.

On Friday, March 28, 2014 7:17:47 AM UTC, Henrik Nordvik wrote:

Hi,
I'm trying out the new cardinality aggregation, and want to measure the
accuracy on my data. I'm using a dataset of a day of sample tweets (2.8m
tweets).

I'm counting the number of unique usernames per language.
To get my "reference" unique count I use this:
GET /twitter-2014.03.26/_search
{
"size": 0,
"aggs": {
"country_count": {
"terms": {
"field": "lang"
},
"aggs": {
"unique_count" : { "value_count" : { "field" : "screen_name" } }
}
}
}
}

Result:
"aggregations": {
"country_count": {
"buckets": [
{
"key": "en",
"doc_count": 872906,
"unique_count": {
"value": 307489
}
},
{
"key": "ja",
"doc_count": 581521,
"unique_count": {
"value": 103035
}
},

To get the approximate count with cardinality:
GET /twitter-2014.03.26/_search
{
"size": 0,
"aggs": {
"country_count": {
"terms": {
"field": "lang"
},
"aggregations": {
"distinct_users_approx": {
"cardinality": {
"field": "screen_name",
"precision_threshold": 40000
}
}
}
}
}
}

Result:
"aggregations": {
"country_count": {
"buckets": [
{
"key": "en",
"doc_count": 872906,
"distinct_users_approx": {
"value": 145541
}
},
{
"key": "ja",
"doc_count": 581521,
"distinct_users_approx": {
"value": 50824
}
},

So, 307489 vs 145541 for english, and 103035 vs 50824 for japanese. Not
very accurate.

  1. Am I doing the reference unique count distinct correctly?
  2. Is it supposed to be this inaccurate on this type of dataset?
  3. Is there any way to improve precision?

Henrik

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/b323f916-81ff-4e98-baa2-e3b0f84fa28e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

I compared the unique count with the total field of the old terms facet and
it matched. What else would the count be? It is lower than doc count.
On 28 Mar 2014 18:54, "Mark Harwood" mark.harwood@elasticsearch.com wrote:

I don't believe value_count is intended to be a unique count.

On Friday, March 28, 2014 7:17:47 AM UTC, Henrik Nordvik wrote:

Hi,
I'm trying out the new cardinality aggregation, and want to measure the
accuracy on my data. I'm using a dataset of a day of sample tweets (2.8m
tweets).

I'm counting the number of unique usernames per language.
To get my "reference" unique count I use this:
GET /twitter-2014.03.26/_search
{
"size": 0,
"aggs": {
"country_count": {
"terms": {
"field": "lang"
},
"aggs": {
"unique_count" : { "value_count" : { "field" : "screen_name" } }
}
}
}
}

Result:
"aggregations": {
"country_count": {
"buckets": [
{
"key": "en",
"doc_count": 872906,
"unique_count": {
"value": 307489
}
},
{
"key": "ja",
"doc_count": 581521,
"unique_count": {
"value": 103035
}
},

To get the approximate count with cardinality:
GET /twitter-2014.03.26/_search
{
"size": 0,
"aggs": {
"country_count": {
"terms": {
"field": "lang"
},
"aggregations": {
"distinct_users_approx": {
"cardinality": {
"field": "screen_name",
"precision_threshold": 40000
}
}
}
}
}
}

Result:
"aggregations": {
"country_count": {
"buckets": [
{
"key": "en",
"doc_count": 872906,
"distinct_users_approx": {
"value": 145541
}
},
{
"key": "ja",
"doc_count": 581521,
"distinct_users_approx": {
"value": 50824
}
},

So, 307489 vs 145541 for english, and 103035 vs 50824 for japanese. Not
very accurate.

  1. Am I doing the reference unique count distinct correctly?
  2. Is it supposed to be this inaccurate on this type of dataset?
  3. Is there any way to improve precision?

Henrik

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/cy59hCNnT0Q/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/b323f916-81ff-4e98-baa2-e3b0f84fa28e%40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/b323f916-81ff-4e98-baa2-e3b0f84fa28e%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAH3vNzN9ftYTJEnAo3si1GKJk0e2qc%2BRoApXmXB2CB_6bT%3Dysw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

value_count is the total number of values extracted per bucket. This
example might help:

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/49e9b196-548a-4e8b-86ed-87857d1973d5%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Ah, so there is currently not easy way of getting exact unique counts out
of elasticsearch?

I found a manual way of doing it:

curl -s 'http://localhost:9200/twitter-2014.03.26/_search' -d '{ "facets":
{ "a": { "terms": { "field": "screen_name", "size":
200000},"facet_filter": {"query": {"term": {"lang": "en"}}}}},"size": 0}' |
./jq '.facets.a.terms | length'
145474 (vs 145541)
curl -s 'http://localhost:9200/twitter-2014.03.26/_search' -d '{ "facets":
{ "a": { "terms": { "field": "screen_name", "size":
200000},"facet_filter": {"query": {"term": {"lang": "ja"}}}}},"size": 0}' |
./jq '.facets.a.terms | length'
50949 (vs 50824)

So the count is quite close! Thank you.

On Friday, March 28, 2014 10:32:55 PM UTC+1, Binh Ly wrote:

value_count is the total number of values extracted per bucket. This
example might help:

https://gist.github.com/bly2k/9843335

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/8669e9f0-eece-4b77-8e99-fec483359e2f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Hi Henrik,

Indeed, there is no way to compute exact unique counts. The reason why we
don't expose such a feature is that it would be very costly. In your case,
the cardinality is not too large so the terms aggregation helped compute
the number of unique values but if the actual cardinality had been very
large (eg. 100M), it is very likely that trying to use the terms agg to do
so would have required a lot of memory (maybe triggering out-of-memory
errors on your nodes), been very slow and caused a lot of network traffic.
We will try to clarify this through documentation or a blog post soon.

Thanks for trying out this new aggregation!

On Mon, Mar 31, 2014 at 11:09 PM, Henrik Nordvik henrikno@gmail.com wrote:

Ah, so there is currently not easy way of getting exact unique counts out
of elasticsearch?

I found a manual way of doing it:

curl -s 'http://localhost:9200/twitter-2014.03.26/_search' -d '{
"facets": { "a": { "terms": { "field": "screen_name", "size":
200000},"facet_filter": {"query": {"term": {"lang": "en"}}}}},"size": 0}' |
./jq '.facets.a.terms | length'
145474 (vs 145541)
curl -s 'http://localhost:9200/twitter-2014.03.26/_search' -d '{
"facets": { "a": { "terms": { "field": "screen_name", "size":
200000},"facet_filter": {"query": {"term": {"lang": "ja"}}}}},"size": 0}' |
./jq '.facets.a.terms | length'
50949 (vs 50824)

So the count is quite close! Thank you.

On Friday, March 28, 2014 10:32:55 PM UTC+1, Binh Ly wrote:

value_count is the total number of values extracted per bucket. This
example might help:

https://gist.github.com/bly2k/9843335

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/8669e9f0-eece-4b77-8e99-fec483359e2f%40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/8669e9f0-eece-4b77-8e99-fec483359e2f%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAL6Z4j7Qxe0SJSfFreK%3DfpqSBfziLzTVoGgi-T73J1YDx6ApTQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Hi Adrien,

I have two comments/questions:

  1. For me, the documentation is still somehow confusing, and the difference
    between the cardinality and value_count aggregations is not 100% clear.

  2. When it comes to counting unique values: I believe that the only way
    that one can take, at the moment, is to use the cardinality aggregation.
    This, however, comes with the price of an approximated result (as discussed
    in the documentation and in the paper describing HyperLogLog++). I
    understand the need to take an approximating approach; but I think that the
    returned result should indicate a bound on the error. Otherwise, the
    returned count could be considered useless. In the documentation the figure
    5% is mentioned --- is it independent of the cardinality? what happens to
    this bound when the precision threshold is >> 40,000?

Thanks for your time,
Dror

On Tuesday, April 1, 2014 9:50:30 AM UTC+2, Adrien Grand wrote:

Hi Henrik,

Indeed, there is no way to compute exact unique counts. The reason why we
don't expose such a feature is that it would be very costly. In your case,
the cardinality is not too large so the terms aggregation helped compute
the number of unique values but if the actual cardinality had been very
large (eg. 100M), it is very likely that trying to use the terms agg to do
so would have required a lot of memory (maybe triggering out-of-memory
errors on your nodes), been very slow and caused a lot of network traffic.
We will try to clarify this through documentation or a blog post soon.

Thanks for trying out this new aggregation!

On Mon, Mar 31, 2014 at 11:09 PM, Henrik Nordvik <henr...@gmail.com
<javascript:>> wrote:

Ah, so there is currently not easy way of getting exact unique counts out
of elasticsearch?

I found a manual way of doing it:

curl -s 'http://localhost:9200/twitter-2014.03.26/_search' -d '{
"facets": { "a": { "terms": { "field": "screen_name", "size":
200000},"facet_filter": {"query": {"term": {"lang": "en"}}}}},"size": 0}' |
./jq '.facets.a.terms | length'
145474 (vs 145541)
curl -s 'http://localhost:9200/twitter-2014.03.26/_search' -d '{
"facets": { "a": { "terms": { "field": "screen_name", "size":
200000},"facet_filter": {"query": {"term": {"lang": "ja"}}}}},"size": 0}' |
./jq '.facets.a.terms | length'
50949 (vs 50824)

So the count is quite close! Thank you.

On Friday, March 28, 2014 10:32:55 PM UTC+1, Binh Ly wrote:

value_count is the total number of values extracted per bucket. This
example might help:

https://gist.github.com/bly2k/9843335

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/8669e9f0-eece-4b77-8e99-fec483359e2f%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/8669e9f0-eece-4b77-8e99-fec483359e2f%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/96f6d854-466b-46a2-8387-64e785db95e9%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Hi Dror,

On Tue, Nov 25, 2014 at 2:29 PM, Dror Atariah drorata@gmail.com wrote:

Hi Adrien,

I have two comments/questions:

  1. For me, the documentation is still somehow confusing, and the
    difference between the cardinality and value_count aggregations is
    not 100% clear.

I have to agree here... If you have suggestions to make it less confusing,
ideas are highly welcome (even changing the name of the aggs might be an
option if we do it in a major release).

  1. When it comes to counting unique values: I believe that the only way
    that one can take, at the moment, is to use the cardinality aggregation.
    This, however, comes with the price of an approximated result (as discussed
    in the documentation and in the paper describing HyperLogLog++). I
    understand the need to take an approximating approach; but I think that the
    returned result should indicate a bound on the error. Otherwise, the
    returned count could be considered useless. In the documentation the figure
    5% is mentioned --- is it independent of the cardinality? what happens to
    this bound when the precision threshold is >> 40,000?

This is true, only the cardinality aggregation allows to compute unique
counts.

The thing about the error is that there is no bound on it, but higher
errors are less likely. The only thing we might be able to return would
be a condifence interval, but it requires some work... Regarding the 5%
that are mentioned in the documentation, it was just meant as an example to
show that in spite of the approximate approach, results are very close to
accurate. A precision_threshold above 40000 is basically the same as a
precision_threshold of 40000.

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAL6Z4j6-s1dM%2BuYpLDTn_tFfpxevYZmu_3_zvaRiXKwuZi2vOw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Thanks for your quick reply!

On Tue, Nov 25, 2014 at 6:41 PM, Adrien Grand <
adrien.grand@elasticsearch.com> wrote:

On Tue, Nov 25, 2014 at 2:29 PM, Dror Atariah drorata@gmail.com wrote:

  1. For me, the documentation is still somehow confusing, and the
    difference between the cardinality and value_count aggregations is
    not 100% clear.

I have to agree here... If you have suggestions to make it less confusing,
ideas are highly welcome (even changing the name of the aggs might be an
option if we do it in a major release).

Well, name changing is problematic, due to backwards compatibilities and
should be exercised only as the last resort. Beforehand, I'd suggest to add
a section, common to the two aggregations, where there's a single
(minimal)
example that demonstrated the differences.

  1. When it comes to counting unique values: I believe that the only way

that one can take, at the moment, is to use the cardinality aggregation.
This, however, comes with the price of an approximated result (as discussed
in the documentation and in the paper describing HyperLogLog++). I
understand the need to take an approximating approach; but I think that the
returned result should indicate a bound on the error. Otherwise, the
returned count could be considered useless. In the documentation the figure
5% is mentioned --- is it independent of the cardinality? what happens to
this bound when the precision threshold is >> 40,000?

This is true, only the cardinality aggregation allows to compute unique
counts.

The thing about the error is that there is no bound on it, but higher
errors are less likely. The only thing we might be able to return would
be a condifence interval, but it requires some work... Regarding the 5%
that are mentioned in the documentation, it was just meant as an example to
show that in spite of the approximate approach, results are very close to
accurate. A precision_threshold above 40000 is basically the same as a
precision_threshold of 40000.

If there's no theoretical bound, then I guess the best one can hope for is
the probability that the returned value is outside an \epsilon interval
(what you probably refer to as "confidence interval"). This would be great,
not to say absolutely necessary. When a data scientist presents his work,
the business (narrow minded) guys want to know the numbers... :slight_smile:

Furthermore, since ES aims for big data, it is not clear to me how one can
come up with the 40,000 figure. After all, if the number of unique values
is in the order of 1K or 100M, then the threshold cannot be the same... can
it?

Best,
Dror

PS: thanks for the interesting discussion!

--
Adrien Grand

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/cy59hCNnT0Q/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAL6Z4j6-s1dM%2BuYpLDTn_tFfpxevYZmu_3_zvaRiXKwuZi2vOw%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAL6Z4j6-s1dM%2BuYpLDTn_tFfpxevYZmu_3_zvaRiXKwuZi2vOw%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
Dror Atariah, Ph.D.
de.linkedin.com/in/atariah

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CANfRcg3pHpTUMcbdgUuETk-LC0Z%3DOkA8fWzVh1BUZB7iULjH_w%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.