Any chance to get a zero percent of error in Cardinality Aggregation?

pfreixes · August 26, 2016, 7:24am

Elasticsearch claims to implement a none deterministic algorithm called Hyperlog++ [1] with a good tradeoff between accuracy and peformance. The official guide bold that for low-cardinality sets there are an excellent accuracy. My quetion is, Is it possible to get an exact accuracy ?

In my use case ES holds an index with about of 4M of documents, but this documents can be sliced in smaller chunks that roughly get 10K documents per slice. The slice or the documents belonging to it can identified through the slice_id field.

All queries performed to ES use a filter stage to choose the slice_id, it means that each query will involve at maximum 10K documents. Worth mentioning that all documents belonging to the same slice are routed with the same key, therefore all documents are placed to the same shard. Hereby a set of aggregation stages are added to this query to get some information where some of them use the cardinality aggregation to perform a distinct function.

This cardinality of the integer field in this slice is approximately over 2/4k values. But the aggregation stage is usually performed with a previous filter stage that reduce the cardinality to a few hundreds.

How the accuracy will behave in this scenario ? Can we expect a exact value ? If not, what can I do ?

Cheers,

[1] https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-metrics-cardinality-aggregation.html

pfreixes · August 26, 2016, 1:37pm

I had a bit of time to check it by my self, as I can see the accuracy of the cardinality aggregation stage for small sets of values is not negligible. Getting errors from 5% to 10%, that is far away to get an exact value. Growing the threshold value till 1000 I was able to get an exact value for all of the cardinalities, have in mind that we are talking about cardinalities below than 2k different values.

Im wondering if I can guarantee this accuracy with this threshold over a field that has at maximum 2k different integer values, how can I prove it ? The documentation only talks about the memory space used for - in our case 1000 * 8 bytes that gets less than 10K. My real question is about the implementation by it self, how the hashes are mapped to this structure and how an integer value behaves in that. The common sense says that in the best case scenario with an integer field that has 2k variability will be enough 2000 * 4 bytes, obviously the hyperlog algorithm is not adhoc and the best case scenario cant be considered, the what should I expect ?

Cheers,

Topic		Replies	Views
High approximation error rate of cardinality aggregation for low-cardinality sets? Elasticsearch	5	2393	July 5, 2017
Cardinality accuracy for very low cardinality Elasticsearch	1	448	November 20, 2018
Default precision of cardinality aggregation Elasticsearch	8	5142	July 5, 2017
Is the precision of cardinality aggregation decided by total unique value count or filtered unique value count? Elasticsearch	5	180	January 10, 2024
Cardinality, precision, and Top 10 Elasticsearch	1	371	February 20, 2020

Any chance to get a zero percent of error in Cardinality Aggregation?

Related topics