I've been learning more about how the cardinality aggregation approximates unique counts since this forum helped me pinpoint that as the cause of some errors I was seeing last week.
The docs state "Please also note that even with a threshold as low as 100, the error remains under 5%, even when counting millions of items." This does not appear to be true for my sample.
For my data, the exact unique count I was expecting was 488 counted from 1,565 documents. With a precision_threshold of 100, the number Elasticsearch reported was 520, so off by 32, or 6.6% on this small number of documents. Raising the precision_threshold to 1000 returned the correct result of 488.
On a larger sample, I got estimated unique count of 19,855 on actual unique count of 20,079 on 133,155 documents, or a much better 1.1% error rate.
The docs state that the HyperLogLog++ algorithm has excellent accuracy on low-cardinality sets, but I'm kinda seeing the opposite. Thoughts?
Related: is there any way to tell Elasticsearch "yes, I understand this will use more cluster resources, but please give me an exact count anyway" for scenarios where you have more than 40,000 unique values (the upper limit of precision_threshold parameter)?
Firstly considering your small dataset example, the reason for the larger than expected error is probably due to an unexpectedly high number of leading zeros on the hashes of your values. The hyperLogLog++ algorithm counts the number of leading zeros on the hash value and uses this to estimate the unique count. So if your dataset produces a hash value with a large number of leading zeros (larger that it would expect for the actual cardinality) it can skew the results. To mitigate this the algorithm stores multiple of these estimates and averages them to get the final result but rogue hash values can still have an effect especially when the precision is lower since there are less of these estimates to level off the skew.
The reason why increasing the precision threshold to 1000 returns the exact value is because the HyperLogLog++ counts accurately up to the precision threshold and then switches to the approximate approach
The documentation is slightly misleading here in that it doesn't state that the graph shown is a benchmark test and the error that you will see will depend on how your values hash and how much you get the skew effects mentioned above. I will open and issue to get this updated to be more clear.
On your larger dataset, what precision_threshold did you have set?
There is currently no option to tell Elasticsearch to give an exact count. Elasticsearch algorithms need to be highly scalable and be able to scale to billions of document/unique values without causing risks to cluster stability (in this case through OOM errors). If we supported this exact unique count operation the memory usage would be unbounded and we could easily run into OOM errors with large datasets or even if multiple users were executing unique count requests at the same time on moderate datasets.
On your larger dataset, what precision_threshold did you have set?
The error rate I reported in this post was with the default threshold
(which is 100, correct?) I also explicitly added precision_threshold and
did some testing, and as a general rule the closer I got to the actual
number of unique values, the smaller the error rate, then of course once I
exceeded the actual value elasticsearch reported the actual value
accurately.
There is currently no option to tell Elasticsearch to give an exact count.
Oh, I certainly understand why the default behavior is what it is. I would
advocate for the ability to override that for users who explicitly want
to. I think the right strategy is to do the safe thing out of the box but
also provide users the ability to make potentially dangerous config
decisions (and potentially shoot themselves in the foot) if they so
choose. The 40k limit to precision_threshold feels a little bit arbitrary
and may not be appropriate for my use case.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.