Cardinality Aggregation gives wrong number?

Hello, I want to count all unique values of a field.
the Cardinality Aggregation with this request

   "aggs": {
     "1": {
       "cardinality": {
         "field": "user_id"
       }
     }
   },
   "size": 0
 }

gives : "value" : 192388
But when I do some calculation in my client:

body: {
     "size": 0,
     "aggs": {
       "uniq_users": {
         "terms": { "field": "user_id", size: 200000 }
       }
     }
   }

i calculated the length of result.aggregations.uniq_users.buckets and it gives : 192631
Is that normal ? and how to resolve it ?

As the documentation says:

A single-value metrics aggregation that calculates an approximate count of distinct values. Values can be extracted either from specific fields in the document or generated by a script.

Have a look at https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-metrics-cardinality-aggregation.html#_precision_control and https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-metrics-cardinality-aggregation.html#_counts_are_approximate

1 Like

This post contains a good discussion of why some type of aggregations in Elasticsearch are approximate.

1 Like

Thank you for your replies.
But I can't find a solution for this, even when I put precision_threshold to it's maximum (40 000).
So there is no solution to get the exact value ?

Maybe think about an approach where you use the cardinality agg in multiple requests, each counting what are guaranteed to be a different set of users totalling < 40,000 in number in each set then add the results of the these numbers up.

To do this use a query on the userID where you take the hashcode and modulo N where N is a number big enough to break the global population down into <40k groups. Each request would filter the docs where the result of "hash modulo N" is 0, then 1, then 2 up to N. This is how "partitioning" works in the terms aggregation.

1 Like

@Mark_Harwood , Thank you for your reply, it sounds to be a good solution.
But I can't find an example of this, what do you mean by taking the hashcode and modulo N, can you give me a simple example please ?
PS : I know I can solve this using the entity centric concept and count directly the number of users, but I want to solve the problem of exact value, because I use everywhere the uniq count ..
Thank you.

Well I have found your post here :


But I can't understand, how to adapt this to my case ?

It's a standard trick [1] to break down a large set of values into evenly divided subsets. It's how we choose to route documents to a choice of shard based on the ID. In my proposal we're using the same approach to divide the large set of user IDs evenly into smaller sets totalling less than 40k each.

Maybe the better question is why does absolute accuracy matter to you? While you're running these queries several new users may have turned up and been unaccounted for?

[1] https://www.toptal.com/big-data/consistent-hashing

The exact value matters because :
I have made an entity for the users containing (firstTimeConnect) => the count of documents in a period gives the number of new users in this period. After this I can calculate the returning users which depends on exact value of all the users in that period. This will serve for retention matrix later.
It was great on a normal dataset. But now I'm facing big datasets, and i'm having illogic values.
I just wanted to understand more, how can I apply the Partitionable aggregation to my case, even if it will take longer time for execution, but it's ok if It will result an exact value.
To resume : I have documents representing user requests containing the field user_id. I want to know how much uniq user I have.

Thank you everyone for your help.
I have made the correct JS code, I hope it will help others :

var users = {}
var N = 20;
var arr = Array.apply(null, { length: N }).map(Number.call, Number)
client.msearch({
  body: [].concat.apply([], arr.map((number) => ([
    { index: 'my_index' },
    {
      "size": 0,
      "aggs": {
        "my_terms": {
          "terms": {
            "field": "user_id",
            "include": {
              "partition": number,
              "num_partitions": 20
            },
            "size": 10000
          }
        }
      }
    }
  ])))
}, (err, result) => {
  if (err) {
    console.log(err)
  }
  else {
    result.responses.forEach(resp => {
      resp.aggregations.my_terms.buckets.forEach(term => {
        if (!users[term.key]) {
          users[term.key] = 1
        }
      })
    })
    console.log(Object.keys(users).length)
  }
})

I don't know if I'm choosing the right number of size & num_partitions.
While i know that the approximate number is near to 200k, so I have made 20 partitions of 10k.
Please @Mark_Harwood if this is the correct way to choose the size & num_partitions.
Thank you.

Looks about right.
My proposal was to still use the cardinality agg but feed it the results of a query with 1/Nth of the user IDs. It's the same principle as you have here but would just return the count of unique user IDs in a partition rather than what you have here which is an exhaustive list of all the actual user IDs in a partition.
In either case, remember to check that the number of users returned is at least one less than the partition size you hope for (in your example 10k) otherwise you know that N is too small and you may have overfilled that partition e.g. with 11000 users which means your overall counts could be wrong.

@Mark_Harwood Thank you again. But I am lost with all these variables.
Can you help me implement this please ?
I have a library which make hash from a string and makes it a number between 0 and 4294967295.
As the approximate number of users will be 200k users.
So my N here will be : 200k/20k(precision_threshold) = 10 ?
Can you tell me how to use the cardinality agg in that case ?
My first request will be like this :

{
  "size": 0, 
  "aggs": {
    "users": {
      "terms": {
        "field": "user_id",
        "size": 1000,
        "include": {
          "partition": 0,
          "num_partitions": 20
        }
      },
      "aggs": {
        "numberOfUsers": {
          "cardinality": {
            "field": "user_id",
            "precision_threshold": 20000
          }
        }
      }
    }
  }
}

Is it correct ?
this will gives

"buckets" : [
        {
          "key" : 2729176039,
          "doc_count" : 296,
          "numberOfUsers" : {
            "value" : 1
          }
        },
        {
          "key" : 3591711302,
          "doc_count" : 254,
          "numberOfUsers" : {
            "value" : 1
          }
        },
...
]

Thank you.

Lose the terms agg - your client is not interested in seeing individual values so you only need the cardinality agg.
The trick is to use a query that only examines docs for a single partition.
So you use a script query passing the partition number as a param. The script should return true if hash modulo N of the user ID == the passed partition number.

@Mark_Harwood
Hello, Thank you for reply, but to make this fast, my hasing algorithm must have a restricted intervall, isn't it ? I mean, for example if the maximum for the hash is 4294967295, to guarantee that I have 40k groups, I will need to have 107k partitions which is not logic that I make 107k requests.
Can you just show me an example to my case ?

Well, If I have understood,

{
  "size": 0,
  "query": {
    "bool": {
      "filter": {
        "script": {
          "script": {
            "source": "doc['user_id'].value % 20 == params.param1",
            "lang": "painless",
            "params": {
              "param1": 0
            }
          }
        }
      }
    }
  },
  "aggs": {
    "numberOfUsers": {
      "cardinality": {
        "field": "user_id"
      }
    }
  }
}

after this we change the param until 19.
Thank you.

No, by hashing, your user IDs will be spread sparsely and evenly across the 4bn number range (unless you happen to be facebook in which case it will be dense). The equation you need is not:

4bn / numPartitions = 40k

but

maxExpectedNumUsers / numPartitions = 40k

Also, in your code example you're not taking the hashcode of the user_id value

1 Like

You could also hash the user_id before indexing the document and store the calculated partition id in a separate field in the document. You can then simply filter on this in the query instead of using a script.

If you want you can also create a larger number of partitions initially and query for a range (which you can reduce as cardinality increases).

2 Likes

After testing,

var N = 20;
var arr = Array.apply(null, { length: N }).map(Number.call, Number)
client.msearch({
  body: [].concat.apply([], arr.map((number) => ([
    { index: 'my-index' },
    {
      "size": 0,
      "query": {
        "bool": {
          "filter": {
            "script": {
              "script": {
                "source": "doc['user_id'].value %"+N+" == params.param1",
                "lang": "painless",
                "params": {
                  "param1": number
                }
              }
            }
          }
        }
      },
      "aggs": {
        "numberOfUsers": {
          "cardinality": {
            "field": "user_id",
            "precision_threshold": 40000
          }
        }
      }
    }
  ])))
}, (err, result) => {
  if (err) {
    console.log(err)
  }
  else {
    console.log(result.responses.map(e=> e.aggregations.numberOfUsers.value).sum())
    console.log(result.responses.map(e=> e.hits.total))
  }
})

I got the sum of the cardinalities : 192619 which is near the exact value (192631)
I got this output

192619
[ 24061,
  26264,
  26039,
  26131,
  25965,
  25627,
  27532,
  26708,
  24823,
  25986,
  25770,
  26098,
  27994,
  25822,
  25668,
  28437,
  26850,
  25080,
  25839,
  26971 ]

As you can see, i'm sure that the all parts are under 40k, and I have specified the threshold to 40k.
Is there something wrong ?
PS : I have indexed the hash from the begin.

Something fishy going on there.
You might have to compare sorted lists of user ids from the various requests to debug what's going on.
If you have a doc that has an array of user_ids not a single value that might be a source of deviation

Well it's a simple field of type double (the hash as we talked),
But I will make a manual compare to debug this.
Thank you !