Elasticsearch claims to implement a none deterministic algorithm called Hyperlog++ [1] with a good tradeoff between accuracy and peformance. The official guide bold that for low-cardinality sets there are an excellent accuracy. My quetion is, Is it possible to get an exact accuracy ?
In my use case ES holds an index with about of 4M of documents, but this documents can be sliced in smaller chunks that roughly get 10K documents per slice. The slice or the documents belonging to it can identified through the slice_id field.
All queries performed to ES use a filter stage to choose the slice_id, it means that each query will involve at maximum 10K documents. Worth mentioning that all documents belonging to the same slice are routed with the same key, therefore all documents are placed to the same shard. Hereby a set of aggregation stages are added to this query to get some information where some of them use the cardinality aggregation to perform a distinct function.
This cardinality of the integer field in this slice is approximately over 2/4k values. But the aggregation stage is usually performed with a previous filter stage that reduce the cardinality to a few hundreds.
How the accuracy will behave in this scenario ? Can we expect a exact value ? If not, what can I do ?
Cheers,