I've been learning more about how the cardinality aggregation approximates unique counts since this forum helped me pinpoint that as the cause of some errors I was seeing last week.
The docs state "Please also note that even with a threshold as low as 100, the error remains under 5%, even when counting millions of items." This does not appear to be true for my sample.
For my data, the exact unique count I was expecting was 488 counted from 1,565 documents. With a precision_threshold of 100, the number Elasticsearch reported was 520, so off by 32, or 6.6% on this small number of documents. Raising the precision_threshold to 1000 returned the correct result of 488.
On a larger sample, I got estimated unique count of 19,855 on actual unique count of 20,079 on 133,155 documents, or a much better 1.1% error rate.
The docs state that the HyperLogLog++ algorithm has excellent accuracy on low-cardinality sets, but I'm kinda seeing the opposite. Thoughts?
Related: is there any way to tell Elasticsearch "yes, I understand this will use more cluster resources, but please give me an exact count anyway" for scenarios where you have more than 40,000 unique values (the upper limit of precision_threshold parameter)?