Cardinality Aggregation gives wrong number?

Wassim_Ben_Amor · February 5, 2019, 12:15pm

Hello, I want to count all unique values of a field.
the Cardinality Aggregation with this request

   "aggs": {
     "1": {
       "cardinality": {
         "field": "user_id"
       }
     }
   },
   "size": 0
 }

gives : "value" : 192388
But when I do some calculation in my client:

body: {
     "size": 0,
     "aggs": {
       "uniq_users": {
         "terms": { "field": "user_id", size: 200000 }
       }
     }
   }

i calculated the length of result.aggregations.uniq_users.buckets and it gives : 192631
Is that normal ? and how to resolve it ?

dadoonet · February 5, 2019, 12:29pm

As the documentation says:

A single-value metrics aggregation that calculates an approximate count of distinct values. Values can be extracted either from specific fields in the document or generated by a script.

Have a look at Cardinality aggregation | Elasticsearch Guide [8.11] | Elastic and Cardinality aggregation | Elasticsearch Guide [8.11] | Elastic

Christian_Dahlqvist · February 5, 2019, 12:31pm

This post contains a good discussion of why some type of aggregations in Elasticsearch are approximate.

Wassim_Ben_Amor · February 5, 2019, 1:55pm

Thank you for your replies.
But I can't find a solution for this, even when I put precision_threshold to it's maximum (40 000).
So there is no solution to get the exact value ?

Mark_Harwood · February 5, 2019, 2:07pm

Maybe think about an approach where you use the cardinality agg in multiple requests, each counting what are guaranteed to be a different set of users totalling < 40,000 in number in each set then add the results of the these numbers up.

To do this use a query on the userID where you take the hashcode and modulo N where N is a number big enough to break the global population down into <40k groups. Each request would filter the docs where the result of "hash modulo N" is 0, then 1, then 2 up to N. This is how "partitioning" works in the terms aggregation.

Wassim_Ben_Amor · February 5, 2019, 2:20pm

@Mark_Harwood , Thank you for your reply, it sounds to be a good solution.
But I can't find an example of this, what do you mean by taking the hashcode and modulo N, can you give me a simple example please ?
PS : I know I can solve this using the entity centric concept and count directly the number of users, but I want to solve the problem of exact value, because I use everywhere the uniq count ..
Thank you.

Wassim_Ben_Amor · February 5, 2019, 2:41pm

Well I have found your post here :

But I can't understand, how to adapt this to my case ?

Mark_Harwood · February 5, 2019, 2:51pm

It's a standard trick [1] to break down a large set of values into evenly divided subsets. It's how we choose to route documents to a choice of shard based on the ID. In my proposal we're using the same approach to divide the large set of user IDs evenly into smaller sets totalling less than 40k each.

Maybe the better question is why does absolute accuracy matter to you? While you're running these queries several new users may have turned up and been unaccounted for?

[1] The Ultimate Guide to Consistent Hashing | Toptal®

Wassim_Ben_Amor · February 5, 2019, 3:05pm

The exact value matters because :
I have made an entity for the users containing (firstTimeConnect) => the count of documents in a period gives the number of new users in this period. After this I can calculate the returning users which depends on exact value of all the users in that period. This will serve for retention matrix later.
It was great on a normal dataset. But now I'm facing big datasets, and i'm having illogic values.
I just wanted to understand more, how can I apply the Partitionable aggregation to my case, even if it will take longer time for execution, but it's ok if It will result an exact value.
To resume : I have documents representing user requests containing the field user_id. I want to know how much uniq user I have.

Wassim_Ben_Amor · February 5, 2019, 3:30pm

Thank you everyone for your help.
I have made the correct JS code, I hope it will help others :

var users = {}
var N = 20;
var arr = Array.apply(null, { length: N }).map(Number.call, Number)
client.msearch({
  body: [].concat.apply([], arr.map((number) => ([
    { index: 'my_index' },
    {
      "size": 0,
      "aggs": {
        "my_terms": {
          "terms": {
            "field": "user_id",
            "include": {
              "partition": number,
              "num_partitions": 20
            },
            "size": 10000
          }
        }
      }
    }
  ])))
}, (err, result) => {
  if (err) {
    console.log(err)
  }
  else {
    result.responses.forEach(resp => {
      resp.aggregations.my_terms.buckets.forEach(term => {
        if (!users[term.key]) {
          users[term.key] = 1
        }
      })
    })
    console.log(Object.keys(users).length)
  }
})

I don't know if I'm choosing the right number of size & num_partitions.
While i know that the approximate number is near to 200k, so I have made 20 partitions of 10k.
Please @Mark_Harwood if this is the correct way to choose the size & num_partitions.
Thank you.

Mark_Harwood · February 5, 2019, 3:36pm

Looks about right.
My proposal was to still use the cardinality agg but feed it the results of a query with 1/Nth of the user IDs. It's the same principle as you have here but would just return the count of unique user IDs in a partition rather than what you have here which is an exhaustive list of all the actual user IDs in a partition.
In either case, remember to check that the number of users returned is at least one less than the partition size you hope for (in your example 10k) otherwise you know that N is too small and you may have overfilled that partition e.g. with 11000 users which means your overall counts could be wrong.

Wassim_Ben_Amor · February 5, 2019, 6:16pm

@Mark_Harwood Thank you again. But I am lost with all these variables.
Can you help me implement this please ?
I have a library which make hash from a string and makes it a number between 0 and 4294967295.
As the approximate number of users will be 200k users.
So my N here will be : 200k/20k(precision_threshold) = 10 ?
Can you tell me how to use the cardinality agg in that case ?
My first request will be like this :

{
  "size": 0, 
  "aggs": {
    "users": {
      "terms": {
        "field": "user_id",
        "size": 1000,
        "include": {
          "partition": 0,
          "num_partitions": 20
        }
      },
      "aggs": {
        "numberOfUsers": {
          "cardinality": {
            "field": "user_id",
            "precision_threshold": 20000
          }
        }
      }
    }
  }
}

Is it correct ?
this will gives

"buckets" : [
        {
          "key" : 2729176039,
          "doc_count" : 296,
          "numberOfUsers" : {
            "value" : 1
          }
        },
        {
          "key" : 3591711302,
          "doc_count" : 254,
          "numberOfUsers" : {
            "value" : 1
          }
        },
...
]

Thank you.

Mark_Harwood · February 5, 2019, 6:43pm

Lose the terms agg - your client is not interested in seeing individual values so you only need the cardinality agg.
The trick is to use a query that only examines docs for a single partition.
So you use a script query passing the partition number as a param. The script should return true if hash modulo N of the user ID == the passed partition number.

Wassim_Ben_Amor · February 6, 2019, 8:53am

@Mark_Harwood
Hello, Thank you for reply, but to make this fast, my hasing algorithm must have a restricted intervall, isn't it ? I mean, for example if the maximum for the hash is 4294967295, to guarantee that I have 40k groups, I will need to have 107k partitions which is not logic that I make 107k requests.
Can you just show me an example to my case ?

Wassim_Ben_Amor · February 6, 2019, 9:08am

Well, If I have understood,

{
  "size": 0,
  "query": {
    "bool": {
      "filter": {
        "script": {
          "script": {
            "source": "doc['user_id'].value % 20 == params.param1",
            "lang": "painless",
            "params": {
              "param1": 0
            }
          }
        }
      }
    }
  },
  "aggs": {
    "numberOfUsers": {
      "cardinality": {
        "field": "user_id"
      }
    }
  }
}

after this we change the param until 19.
Thank you.

Mark_Harwood · February 6, 2019, 9:20am

No, by hashing, your user IDs will be spread sparsely and evenly across the 4bn number range (unless you happen to be facebook in which case it will be dense). The equation you need is not:

4bn / numPartitions = 40k

but

maxExpectedNumUsers / numPartitions = 40k

Also, in your code example you're not taking the hashcode of the user_id value

Christian_Dahlqvist · February 6, 2019, 9:21am

You could also hash the user_id before indexing the document and store the calculated partition id in a separate field in the document. You can then simply filter on this in the query instead of using a script.

If you want you can also create a larger number of partitions initially and query for a range (which you can reduce as cardinality increases).

Wassim_Ben_Amor · February 6, 2019, 9:22am

After testing,

var N = 20;
var arr = Array.apply(null, { length: N }).map(Number.call, Number)
client.msearch({
  body: [].concat.apply([], arr.map((number) => ([
    { index: 'my-index' },
    {
      "size": 0,
      "query": {
        "bool": {
          "filter": {
            "script": {
              "script": {
                "source": "doc['user_id'].value %"+N+" == params.param1",
                "lang": "painless",
                "params": {
                  "param1": number
                }
              }
            }
          }
        }
      },
      "aggs": {
        "numberOfUsers": {
          "cardinality": {
            "field": "user_id",
            "precision_threshold": 40000
          }
        }
      }
    }
  ])))
}, (err, result) => {
  if (err) {
    console.log(err)
  }
  else {
    console.log(result.responses.map(e=> e.aggregations.numberOfUsers.value).sum())
    console.log(result.responses.map(e=> e.hits.total))
  }
})

I got the sum of the cardinalities : 192619 which is near the exact value (192631)
I got this output

As you can see, i'm sure that the all parts are under 40k, and I have specified the threshold to 40k.
Is there something wrong ?
PS : I have indexed the hash from the begin.

Mark_Harwood · February 6, 2019, 9:30am

Something fishy going on there.
You might have to compare sorted lists of user ids from the various requests to debug what's going on.
If you have a doc that has an array of user_ids not a single value that might be a source of deviation

Wassim_Ben_Amor · February 6, 2019, 9:32am

Well it's a simple field of type double (the hash as we talked),
But I will make a manual compare to debug this.
Thank you !

Topic		Replies	Views
Get number of unique values in a field Elasticsearch	3	1026	July 6, 2017
Cardinality is more than Count. How to achieve the exact uniq count? Elasticsearch	7	2179	July 5, 2017
Cardinality agg off by one even after precision increase Elasticsearch	2	406	September 30, 2021
Is the precision of cardinality aggregation decided by total unique value count or filtered unique value count? Elasticsearch	5	180	January 10, 2024
Cardinality and value_count aggr values are 200-500% off Elasticsearch	4	857	March 21, 2017

Cardinality Aggregation gives wrong number?

Related topics