Getting accurate cardinality for a field in single shard index

BalaKumaran · August 5, 2016, 5:02am

Is there any way to get exact count of cardinality of a field in a single shard index. I have thought of taking terms aggregation and getting the number of buckets for cardinality for that field. But i fear about the memory usage in that case. Is there a way for efficient pagination of terms aggregation in a single shard index. Please educate me in this case.

Mark_Harwood · August 5, 2016, 7:49am

Is there a way for efficient pagination of terms aggregation in a single shard index.

I've proposed this approach before: Accurate cardinality for large datasets (where no indexing is happening...) · GitHub

BalaKumaran · August 5, 2016, 11:29am

//Make a note of the number in the (potentially inaccurate) response.

//Divide this by the size of bucket we can accurately predict in one request (40k) minus some allowance
//for hashing unevenness (let's say 30k)
// 1,000,00 / 30k == 33~ requests
// Now perform 33 different requests to account for a subset of the unique values :

Sorry unable to get what you are saying.

//Divide this by the size of bucket we can accurately predict in one request (40k) minus some allowance

What you mean by size of bucket here and how you predict that in one request

Mark_Harwood · August 5, 2016, 12:17pm

40000 is the maximum number of unique things that can be counted with guaranteed accuracy.
To ensure an overall accurate result you must then issue multiple requests where each request is guaranteed to be counting <=40000 unique values for a set of values different to subsequent requests.

To figure out how many requests to make we first get the (potentially inaccurate) total number of unique values. We would divide this by 40,000 but have to make allowance for the overall number potentially being slightly inaccurate so I chose a conservative number of 30,000.

We then make this number of requests and ensure each time we look at different groups of terms from previous requests by using the hash/modulo==N approach

BalaKumaran · August 5, 2016, 12:29pm

Thanks a lot that makes sense now. one of a better way to calculate cardinality rather than memory hungry terms aggregation

BalaKumaran · August 9, 2016, 2:35pm

When i tested this found that .hashCode() could also return -ve values. So in hash/modulo==N approach N an take values from -16 to +16.

Mark_Harwood · August 9, 2016, 3:04pm

Good point. Math.abs(hashcode) would make sense then.

BalaKumaran · August 29, 2016, 5:59am

When i tried the above mentioned approach in one of a field i got following results

inaccurate value using cardinality aggregation 1267903
accurate value using terms aggregation count of buckets 1264203
value got using the approach 1264257

i am sure that no indexing is happening but still i use multiple indexes to query in same request.

Still there are value mismatch what could be the reason.

Also with my understanding what i get from above approach is that

All the values present for a particular given field is iterated in cardinality aggregation script.
We take modulo and bucket those values and still all the values that were previously calculated are iterated again and modulo is found unnecessarily.

Am i correct in my assumption. if yes is there a way such that i can avoid this. i want to do this as hashes are computed on the flow and unnecessary iteration can be avoided.

Topic		Replies	Views
Cardinality Aggregation gives wrong number? Elasticsearch	33	7753	March 7, 2019
Cardinality aggregation: discrete shards Elasticsearch	1	339	July 5, 2017
Get number of unique values in a field Elasticsearch	3	1054	July 6, 2017
Accuracy on cardinality aggregate Elasticsearch	9	2325	July 6, 2017
Running cardinality for more than 10000 buckets Elasticsearch	14	2964	August 28, 2019

Getting accurate cardinality for a field in single shard index

Related topics