In the field of big data，many olap engine already support “accurate distinct” ,such as kylin，clickhouse。Those engine can calculate Ultra High Cardinality(billion level ) in less than a second 。Dose elasticsearch has any plan to support this feature？In my company，almost all businesses need “accurate distinct”。I think many elasticsearch's user has this requirement。
Elasticsearch aggregations work in near-real time on the data that is available in the cluster and results change as data is being indexed, updated or removed. Running accurate distinct queries efficiently across large amounts of data distributed across large number of nodes is prohibitively expensive, so you always need to make a trade-off. You can generally achieve at most two of these (fast queries, handle large amounts of data and having accurate results) and Elasticsearch often chooses to sacrifice accuracy in favour of querying large data volumes fast.
Different systems however make different trade-offs. I am no expert on OLAP tools but believe they often create a view/cube by processing a snapshot of the data at a point in time that then can be navigated and support these types of queries efficiently. Creating this OLAP cube can however take considerable time and it is only once it is created you benefit from fast queries. Once it has been created it also needs to be updated in order to not grow more and more stale. If the cube does not cover all aspects of your data you may need to create new ones for different types of queries. The data that forms part of the cube can be accurately queried, but that may not always represent the current state of the data. This is great for e.g. reporting, when you are looking at older data but may not be ideal if you want to analyze what is going on right now.
I would recommend you use the solution that best fits your needs as I do not believe you can find any system that does not make some kind of trade-off. This is something Mark Hardwood is better at explaining and he has discussed it numerous times here if I remember correctly, but I only found this thread so far.
I don't think so，many OLAP engine doesn't use cube to pre-aggregate， such as clickhouse，they are realtime。
Now ，elasticSearch use HLL to calculates an approximate count of distinct values，but the
precision_threshold maximum supported value is 40000，it‘s too small。
Even though es does not plan to support “accurate distinct”，I think the
should be bigger 。In the field of big data ,
40000 is to small。
If you have data distributed across a large number of nodes and are looking to accurately find the cardinality of a specific field that has high cardinality, information about all values basically need to be gathered and compared which means a lot of data need to be transferred especially when you want to support cardinality aggregation combined with filters. This puts strain on networks and require a lot of processing, which is why Elasticsearch uses HLL. I guess it could be possible to instead maintain a central dictionary per field that can be used to calculate cardinality, but this can get large in a large system and instead requires a lot of data to flow between nodes at index/update time instead, potentially slowing this down. It would also be hard to combine with arbitrary filters at the document level. I do not know Clickhouse, so do not know which trade-off they have made, but given physical limitations I am pretty sure some has been made.
Let's not discuss “accurate distinct”，now that elasticsearch have used HLL， why not consider adjusting the maximum value of
precision_threshold . This will satisfy users who need more precise results.
I do not know how the threshold of 40000 was selected nor how increasing this would impact on performance and memory usage, so will need to leave that to someone else. Given that cardinality aggregations can be nested under other aggregations means that even a small increase in size per HLL could result in substantially more memory being used overall.
Worth noting that result sizes lower than the precision threshold aren’t guaranteed to be 100% accurate either. The threshold marks the change in counting strategy from using hashes (which conceivably could have collisions) to HLL. Crossing the threshold is a change in probability of inaccuracy and required memory.
The whole feature and others are built on the premise that we pick FB out of the “Fast, Accurate, Big - pick 2” trade off.
What I want to say is that this parameter should be flexible for users to adjust, not limited.So that users can adjust this parameter according to their own server resources.
And what guarantee would that higher threshold bring in light of my last statement?
May be higher accuracy
So still not a solution?
Of course solved，most businesses use other solutions.But a small part of the business on elasticsearch recently raised such demands of “accurate distinct”.
One way to get around this may be to create one or more entity-centric indices so you can filter and do a count instead of use cardinality aggregations. It will take up more disk space but could be accurate and fast. The new transform api might be useful here.
Of course the transform API or any other form of heavy-lifting computation trying to increase precision will trade timeliness for accuracy so your "100% correct" answers will be "100% accurate as of [some time ago]".
This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.