Getting accurate cardinality for a field in single shard index

Is there any way to get exact count of cardinality of a field in a single shard index. I have thought of taking terms aggregation and getting the number of buckets for cardinality for that field. But i fear about the memory usage in that case. Is there a way for efficient pagination of terms aggregation in a single shard index. Please educate me in this case.

Is there a way for efficient pagination of terms aggregation in a single shard index.

I've proposed this approach before: https://gist.github.com/markharwood/e511bbe31e389bc04f08

//Make a note of the number in the (potentially inaccurate) response.

//Divide this by the size of bucket we can accurately predict in one request (40k) minus some allowance
//for hashing unevenness (let's say 30k)
// 1,000,00 / 30k == 33~ requests
// Now perform 33 different requests to account for a subset of the unique values :

Sorry unable to get what you are saying.

//Divide this by the size of bucket we can accurately predict in one request (40k) minus some allowance

What you mean by size of bucket here and how you predict that in one request

40000 is the maximum number of unique things that can be counted with guaranteed accuracy.
To ensure an overall accurate result you must then issue multiple requests where each request is guaranteed to be counting <=40000 unique values for a set of values different to subsequent requests.

To figure out how many requests to make we first get the (potentially inaccurate) total number of unique values. We would divide this by 40,000 but have to make allowance for the overall number potentially being slightly inaccurate so I chose a conservative number of 30,000.

We then make this number of requests and ensure each time we look at different groups of terms from previous requests by using the hash/modulo==N approach

1 Like

Thanks a lot that makes sense now. one of a better way to calculate cardinality rather than memory hungry terms aggregation

1 Like

When i tested this found that .hashCode() could also return -ve values. So in hash/modulo==N approach N an take values from -16 to +16.

Good point. Math.abs(hashcode) would make sense then.

1 Like

When i tried the above mentioned approach in one of a field i got following results

inaccurate value using cardinality aggregation 1267903
accurate value using terms aggregation count of buckets 1264203
value got using the approach 1264257

i am sure that no indexing is happening but still i use multiple indexes to query in same request.

Still there are value mismatch what could be the reason.

Also with my understanding what i get from above approach is that

  1. All the values present for a particular given field is iterated in cardinality aggregation script.
  2. We take modulo and bucket those values and still all the values that were previously calculated are iterated again and modulo is found unnecessarily.

Am i correct in my assumption. if yes is there a way such that i can avoid this. i want to do this as hashes are computed on the flow and unnecessary iteration can be avoided.