What is the default precision for the cardinality aggregation?
When I was running very small JUnit integration tests, I noticed that the default implementation of the cardinality aggregation had an error of 1-2 even for cardinalities below 10. As it is stated that the error is below 5%, I assumed that it should be exact for cardinalities below 20.
A similar problem arises with bigger sets when applying more and more filters. Customers might drill down the set of results and then it looks strange if the cardinality says there are 2 different values, but in the detailed results there are 3 (or vice versa).
Documentation claims that the cardinality aggregation has
excellent accuracy on low-cardinality sets
Of course, it is easy to set the precision threshold, but why is the error so big when using the default?
Are there any best practices for a good value for the precision?
The Cardinality agg is a bit tricky to wrangle at times. First some theoretical explanation, then I'll tackle your actual questions
The HyperLogLog algorithm is less accurate when few elements have been added to the sketch (e.g. the cardinality is low). Because the algorithm relies on stochastic processes -- the longest run of zeros in the hash, which for all intents is essentially random -- a few "bad" hashes at the beginning can ruin the estimation. This can cause the HLL sketch to under- or over-estimate.
This effect disappears once you have "enough" data, as the regular distribution of data will start to smooth out those early "bad" values. So the cardinality will begin to converge on the true cardinality once you've warmed the sketch up
To overcome this problem, there are two mechanisms. First are some empirical corrections that Google added to the algorithm (which is why it's technically HLL++ instead of HLL). The second is that we use a hashmap when the cardinality is < precision, so that cardinality counts are exact until you reach that cutover point, at which point the hashmap is "replayed" into the HLL algo and it starts approximating
In Elasticsearch, we automatically reduce the precision of the Cardinality agg if it is embedded inside other aggregations. The deeper it is embedded, the more precision is reduced. This is done for memory reasons. While the HLL algorithm is remarkably lightweight compared to a naive hashmap solution, it's still relatively chunky (a few kb) compared to simple counters used by aggs like count/min/max.
If you embed the Cardinality agg deeply inside several layers of terms buckets for example, each with a branching factor of 50, you can quickly end up using gb of memory just to hold the Cardinality aggs under default precision.
So we scale it back to proactively protect the user. You can override this by setting the precision directly.
Ok, with that all out of the way, to your specific questions:
What is the default precision for the cardinality aggregation?
The default internal precision is 14, which translates to ~3000 distinct values. This means that the default precision will count <3000 values with a hashmap and be exact, and anything over 3000 values will start approximating with HLL.
However, the minimum internal precision is 4, which translates to a cardinality of 2 before it starts approximating. And the HLL sketch that is used will also be correspondingly smaller, with less accuracy. So if you're Cardinality aggs are deeply embedded, they may be getting scaled down to the very minimum precision level and only crudely approximating the true cardinality.
I'd also check and make sure you're constructing the Card agg correctly and passing in a valid precision value: the Card agg accepts a precision based on the number of distinct values. So you'd set 3000, which internally translates to the precision of 14 for the HLL agg.
Sorry, this was super long, but I figured it'd make for a good resource for others to reference in the future
You can override this by setting the precision directly.
Do you mean increasing the value of precision_threshold at the top level, such that it leads to higher precision at lower (bucket) levels compared to the default, or is there a more elegant way of doing this that simply prevents the reduction in precision within nested aggs?
Ah, sorry, should have been clearer. You'll have to set the precision on each "lower" cardinality agg individually. Their settings don't "cascade" down or anything...they are set individually.
No elegant solution, alas, as this is actively working against the defaults (which are in place to prevent memory issues). So its a "I know what I'm doing" configuration, where you have to manually override it since it could cause memory issues
Just a note for anyone who stumbles on this thread in the future, ES 5.0+ defaults to running the terms aggregation in breadth-first mode. This alleviates much of the memory concerns that used to plague things like cardinality (e.g. a HLL sketch being created for millions of buckets due to branching factor).
Due to the more aggressive tree pruning that breadth-first provides, memory usage is more predictable. The cardinality aggregation now uses a static precision; it doesn't try to "adapt" it's memory usage based on how deeply embedded it is anymore. So if you don't change the default precision, give you ~3000 unique values before it switches from the exact linear-counting to the approximate HLL algorithm
Please take also the same time to be such detailed trying to explain the internals of ElasticSearch, it was a great help and a good lecture :).
Im still having a pair of questions.
When you said that while the cardinality < 3000 the hashmap algo will be still used, does it mean that in execution time once the threshold is reached is when the algo is replaced by the HLL? If the precision threshold is perhaps set to 5000 this will be the new limit to be considered to change the algo?
I've noticed that the performance of the queries that use the cardinality aggregation is affected by the precision_threshold value, something that left me stunned, when and only when the cardinality aggs is executed deeper to the top level. However, the performance is not affected when the cardinality is executed as a top level aggregator.
For example, the following aggregator with a precision threshold of 6k spends twice time if you compare with the same query but using 3K as precision threshold. Pay attention that the top aggregation has a size of 1.
Using the cardinality aggregator as the top one and modifying the precision threshold the time took is almost the same always, as it should be expected taking into account the way of the algorithm uses the precision threshold, where the cost to calculate the cardinality must be the same if the input is the same, having only a cost in terms of memory if the precision threshold is modified.
Any explanations?
I couldn't find info about the transition between the hashmap and the HLL in the main ES webpages, and for me it is crucial. I would suggest you to add it to the official guide, it is something more important than an edge case:)
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.