Why aren't HyperLogLog++ data structures stored in docs?

My current understanding regarding cardinality queries is that the best performance you can get is via precomputing hashes of values and indexing them along with or instead of the original field value.

Aggregations then use these precomputed hashes to build a HyperLogLog+ data structure on the fly and ultimately return a cardinality result.

Although HLLs are memory efficient, they do take a non-trivial amount of time to build for a field with lots of values.

I'm wondering why, at index creation time, we can't specify in the mapping that a field's purpose is for aggregation queries only and thus Elasticsearch can simply build and store a serialized HLL+ in the document instead of just precomputed hashes for the field?

Lemme know if I've misunderstood anything.

Thanks,
Mike

1 Like

I am not sure what you mean by 'serialized HLL+ in the document'?

How long is the 'non-trivial amount of time'?

HyperLogLog++ is an algorithm not for documents but for field values. The first cardinality aggregation always takes a longer time than subsequent runs to allocate the resources.

But, precomputed hashes are not enabled by default.

For iterating over the values with precomputed hashes as input, the algorithm needs the field mapper murmur3 which makes it efficient even for high cardinality cases, since only the small murmur3 hash keys are needed to be loaded into memory instead of large field values.

If HyperLogLog++ is still slow, check if tuning the parameter precision_threshold helps.

To make the example a little more concrete, let's say we're storing user IDs in an ES doc:

{
  "user_ids": [
    "mike",
    "bob",
    "andrew"
  ]
}

With ES right now, I can specify that I want to precompute murmur3 hashing for the field:

{
    "user_ids": {
        "type": "string",
        "fields": {
            "hash": {
                "type": "murmur3"
            }
        }
    }
}

I guess what I'm proposing is instead be able to specify something like this in the mapping:

{
    "user_ids": {
        "type": "string",
        "fields": {
            "hll": {
                "type": "hll++"
            }
        }
    }
}

At index time, Elasticsearch will take the user_ids field, compute the murmur3 hash but also build a HyperLogLog++ structure and store the serialized version as a blob in the user_ids.hll field. The blob would c

During aggregation queries, a HyperLogLog++ no longer needs to be built on the fly, saving CPU on fields with high cardinality of values.

As for "non-trivial amount of time", I haven't run benchmarks just yet but could. It'd need a change to the HLL++ implementation to allow loading the HLL from some serialized representation that stores the config (e.g. precision) as well as the hash table.

1 Like

The downfall with the approach I'm suggesting above of course is that a precision threshold is no longer configurable at aggregation time and would instead be "hardcoded".

In practice, this maybe isn't so bad as users could specify HLLs at different precision/error thresholds:

{
    "user_ids": {
        "type": "string",
        "fields": {
            "hll_low": {
                "type": "hll++",
                "error": "0.05"
            },
            "hll_medium": {
                "type": "hll++",
                "error": "0.01"
            },
            "hll_high": {
                "type": "hll++",
                "error": "0.001"
            }
        }
    }
}