Cardinality Aggregation gives wrong number?

@Mark_Harwood
I'm still blocked ..
Can you tell me what is exactly the precision_threshold ?
Because I can't find the technical part of it in the docs.
Is it when it's set to 40k for example, means when calculating cardinality of a set containing <40k, it will be exact ??

Yes, you can think of it as the threshold between when it switches between stuffing values into a hashmap (which is an accurate way to count) versus using a probabilistic data structure which is inaccurate but more scalable

So, to resume all this.
There is no way to get exact cardinality value.

Ah no - I think I see what you've done. You've got hash collisions.

The hash of the userID shouldn't be held in a field called userID - it should be in a field called userIDHash (for that is what it is).
You use userIDHash to group results into smaller work partitions for processing (using that script query) but you always do your cardinality agg on the userID field.

Well, I assume that my Hashing algorithm (32bits hash from farmhash(node library)) has made some collisions.
So for testing I have deleted user Ids which makes collisions (6 user ids).
So I have to find exactly 192 625 unique user ID.
I have made the field user_id_hash even if i'm not convinced, because calculating cardinality on hash or the string value is the same, because they must be uniq.

So like always, I have run my script

var N = 20;
var arr = Array.apply(null, { length: N }).map(Number.call, Number)
console.log(arr)
client.msearch({
  body: [].concat.apply([], arr.map((number) => ([
    { index: 'my-index' },
    {
      "size": 0,
      "query": {
        "bool": {
          "filter": {
            "script": {
              "script": {
                "source": "doc['user_id_hash'].value %" + N + " == params.param1",
                "lang": "painless",
                "params": {
                  "param1": number
                }
              }
            }
          }
        }
      },
      "aggs": {
        "numberOfUsers": {
          "cardinality": {
            "field": "final_user_id",
            "precision_threshold": 40000
          }
        }
      }
    }
  ])))
}, (err, result) => {
  if (err) {
    console.log(err)
  }
  else {
    console.log("TOTAL CARDINALITY")
    console.log(result.responses.map(e => e.aggregations.numberOfUsers.value).sum())
    console.log("CARDINALITIES PER PARTITION")
    console.log(result.responses.map(e => e.aggregations.numberOfUsers.value))
    console.log("DOC COUNT PER PARTITION")
    console.log(result.responses.map(e => e.hits.total))
  }
})

which gives the wrong cardinality:

TOTAL CARDINALITY
192618
CARDINALITIES PER PARTITION
[ 9569,
  9568,
  9563,
  9711,
  9436,
  9485,
  9560,
  9808,
  9703,
  9517,
  9732,
  9678,
  9594,
  9718,
  9721,
  9602,
  9670,
  9542,
  9816,
  9625 ]
DOC COUNT PER PARTITION
[ 25507,
  25941,
  26224,
  24981,
  25582,
  25108,
  26235,
  25508,
  26982,
  26750,
  26999,
  26178,
  25190,
  25702,
  26841,
  25250,
  26357,
  26333,
  27842,
  28144 ]

I'm really not convinced by the precision_threshold, it's like random calculation.
Event if I have tried to simulate the repartition and calculate the cardinalities manually, I can see that the cardinality calculated in each repartition is different of the result of ES calculation.
code :

let usersOrignal = JSON.parse(fs.readFileSync('./users-original.json'))
let usersHash = usersOrignal.map(e =&gt; hash.hash32(e))
console.log("CORRECT CARDINALITY :")
console.log([...new Set(usersHash)].length)
console.log(usersHash.length)

var N = 20;
var arr = Array.apply(null, { length: N }).map(Number.call, Number)
console.log("CORRECT CARDINALITIES :")
console.log(arr.map(number =&gt; {
return (usersHash.filter(e =&gt; e % N == number).length)
}))

which gives this output :

CORRECT CARDINALITY :
192625
192625
CORRECT CARDINALITIES :
[ 9570,
  9568,
  9563,
  9711,
  9435,
  9484,
  9560,
  9808,
  9703,
  9517,
  9735,
  9678,
  9594,
  9719,
  9723,
  9603,
  9671,
  9542,
  9815,
  9626 ]

I really can't understand ...

How many shards are you using? Does it fail with only 1 shard?

I am using one shard,

"number_of_shards" : "1",
"number_of_replicas" : "0",

OK - if you can identify an example of <40k IDs that are counted incorrectly then it sounds like we can open a bug for that. What version of elasticsearch?

I mis-spoke. The precision threshold is a distinction between using method 1 and method 2 for counting but method 1 still does not guarantee absolute accuracy - it can still be subject to hash collisions.
Apologies for the misleading advice.

1 Like

So, how can we define the threshold exactly ? it is supposed to give exact value (set to 40k) when documents are under 40k ?

No, I assumed wrong. Under the precision threshold (which is max 40k) it uses one counting technique and if the buckets collected gets greater than this number it switches to a cheaper but less accurate technique. Even the first technique has a small margin of error.

For fully accurate counting of large numbers of IDs you can use either the composite aggregation with the after parameter or the slightly messier "terms aggregation with partitioning" approach you used here. In each case your client will have to page through to completion and count the matches.

I appreciate the confusion in the array of choices available. It's impossible to satisfy all requirements around speed, accuracy and scale so there are a number of techniques possible each with different trade-offs. I've been sketching out ideas for a blog on the topic to help people navigate the options which is centred around a decision tree something like this:


I'd appreciate any feedback on if this sort of thing would be useful.

1 Like

Thank you for your reply.
Just a last notice.
I have tried to see the cardinality on a little dataset (18k of documents containing userId field), and it makes also an approximation. For real, i am disappointed that elastic can not produce an exact analytics result even on low data size. At least, now, i'm sure Elastic is not made for analytics, I need a better big data solution.
Thank you for your time.

Everything elasticsearch does is designed for scale. We don't want to build data analysis functions that blow up when users provide us with a lot of data.

The world of big data is, by necessity, built on fuzzier constructs [1]. We're using the same algorithms used by the other big data platforms for exactly the same reasons. As I outlined here - it's a necessary trade off in the age of big data. You can't expect to beat physics.

Arguably we could offer a function to guarantee cardinality accuracy at small scale but small scale is not our mission.

[1] Introduction to Probabilistic Data Structures - DZone

3 Likes

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.