Hi,
I was trying to check Cardinality Aggregation. I believe it will give me
an approx value of the number of unique users.
Below is what I am using.
{
"aggs" : {
"user_count" : {
"cardinality" : {
"field" : "userid"
}
}
}
}
Can some one confirm a few things for me.
What is the accuracy of the result.
The accuracy is quite good in general, we tried to give some examples in
the documentations to show that even with rather low values of the
precision threshold, the error is often very low. The paper about
HyperLogLog++ (the algorithm beneath the cardinality aggregation) gives
more information about the error margin that you can expect (see figure 8
in particular).
Is this is the only way or are there other options to do this as well.
Not really. If you know the cardinality is going to be low (< 1000), you
could use a terms aggregation with a size of 0 (which tells elasticsearch
to return all terms) and count the number of terms returned. Although this
would help you find out the exact number of terms, this would not scale for
high cardinalities, and the cardinality aggregation has optimizations
that make it almost accurate when cardinalities are low anyway.
This feature is experimental as per docs, what are the future roadmaps
for this, if any ?
There are no concrete plans at the moment. When we added this aggregation
in Elasticsearch 1.1, it was quite new in terms of functionalities that
elasticsearch exposes, so we wanted to make it experimental in order to
have the freedom to modify it based on feedback. The experimental flag will
very likely be removed in the next major version.
Hi,
I was trying to check Cardinality Aggregation. I believe it will give me
an approx value of the number of unique users.
Below is what I am using.
{
"aggs" : {
"user_count" : {
"cardinality" : {
"field" : "userid"
}
}
}
}
Can some one confirm a few things for me.
What is the accuracy of the result.
The accuracy is quite good in general, we tried to give some examples in
the documentations to show that even with rather low values of the
precision threshold, the error is often very low. The paper about
HyperLogLog++ (the algorithm beneath the cardinality aggregation) gives
more information about the error margin that you can expect (see figure 8
in particular).
Is this is the only way or are there other options to do this as well.
Not really. If you know the cardinality is going to be low (< 1000), you
could use a terms aggregation with a size of 0 (which tells elasticsearch
to return all terms) and count the number of terms returned. Although this
would help you find out the exact number of terms, this would not scale for
high cardinalities, and the cardinality aggregation has optimizations
that make it almost accurate when cardinalities are low anyway.
This feature is experimental as per docs, what are the future roadmaps
for this, if any ?
There are no concrete plans at the moment. When we added this aggregation
in Elasticsearch 1.1, it was quite new in terms of functionalities that
elasticsearch exposes, so we wanted to make it experimental in order to
have the freedom to modify it based on feedback. The experimental flag will
very likely be removed in the next major version.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.