How to cheat unique count to not use HLL++

Marcin_Kubica · November 22, 2015, 2:58am

Hi

Is there any way to cheat unique count to get non-aproximated value?

Need this for kibana displays. Scripted field?

Cheers
Marcin

jpountz · November 23, 2015, 3:29pm

HLL++ is the only implementation for the cardinality aggregation. You can't use anything else. May I ask why you're not happy with it?

Christian_Dahlqvist · November 23, 2015, 5:29pm

Although HLL++ is the only implementation, you can adjust the precision through the 'precision_threshold' parameter. This can be specified in Kibana as advanced JSON Input when the visualisation is created.

Marcin_Kubica · December 1, 2015, 9:32am

@jpountz I'm happy with HLL++ however there are business cases where you just need to have real distinct count.

An example is in situations when you need to pay per unique count, and approximation of any sort is not an option. Unless you are happy to receive ie. up to 5% less for your service (approximated) which can render you not earning any cash at all.

Is there really no option with E(L)K to create a script which would render a metric of distinct count? I would think it can be a matter of ie. creating new index and stuffing if with results of comparison and then tallying the amount reporting back to a metric. If a cluster would stand behind it it should still take waaay quicker than ie. with a single thread count under ie. mongodb shell.

@Christian_Dahlqvist correct, but you still have to accept the error margin. Which again might work for most of cases, but not all.

Christian_Dahlqvist · December 1, 2015, 9:44am

It is correct that there will be an error margin, although smaller. I believe that you can do this in Elasticsearch using a scripted metric aggregation, but that can result in a lot of data needing to be transferred between nodes and will therefore not scale well. It is also unfortunately not currently supported by Kibana.

Marcin_Kubica · December 1, 2015, 10:01am

Cheers mate.