Hi,
I'm trying out the new cardinality aggregation, and want to measure the
accuracy on my data. I'm using a dataset of a day of sample tweets (2.8m
tweets).
I'm counting the number of unique usernames per language.
To get my "reference" unique count I use this:
GET /twitter-2014.03.26/_search
{
"size": 0,
"aggs": {
"country_count": {
"terms": {
"field": "lang"
},
"aggs": {
"unique_count" : { "value_count" : { "field" : "screen_name" } }
}
}
}
}
Result:
"aggregations": {
"country_count": {
"buckets": [
{
"key": "en",
"doc_count": 872906,
"unique_count": {
"value": 307489
}
},
{
"key": "ja",
"doc_count": 581521,
"unique_count": {
"value": 103035
}
},
To get the approximate count with cardinality:
GET /twitter-2014.03.26/_search
{
"size": 0,
"aggs": {
"country_count": {
"terms": {
"field": "lang"
},
"aggregations": {
"distinct_users_approx": {
"cardinality": {
"field": "screen_name",
"precision_threshold": 40000
}
}
}
}
}
}
Result:
"aggregations": {
"country_count": {
"buckets": [
{
"key": "en",
"doc_count": 872906,
"distinct_users_approx": {
"value": 145541
}
},
{
"key": "ja",
"doc_count": 581521,
"distinct_users_approx": {
"value": 50824
}
},
So, 307489 vs 145541 for english, and 103035 vs 50824 for japanese. Not
very accurate.
- Am I doing the reference unique count distinct correctly?
- Is it supposed to be this inaccurate on this type of dataset?
- Is there any way to improve precision?
Henrik
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/91eead45-319c-4a72-81a9-bad214a3ee61%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.