Are there plans to support “accurate distinct”？

Christian_Dahlqvist · May 28, 2020, 5:28pm

Elasticsearch aggregations work in near-real time on the data that is available in the cluster and results change as data is being indexed, updated or removed. Running accurate distinct queries efficiently across large amounts of data distributed across large number of nodes is prohibitively expensive, so you always need to make a trade-off. You can generally achieve at most two of these (fast queries, handle large amounts of data and having accurate results) and Elasticsearch often chooses to sacrifice accuracy in favour of querying large data volumes fast.

Different systems however make different trade-offs. I am no expert on OLAP tools but believe they often create a view/cube by processing a snapshot of the data at a point in time that then can be navigated and support these types of queries efficiently. Creating this OLAP cube can however take considerable time and it is only once it is created you benefit from fast queries. Once it has been created it also needs to be updated in order to not grow more and more stale. If the cube does not cover all aspects of your data you may need to create new ones for different types of queries. The data that forms part of the cube can be accurately queried, but that may not always represent the current state of the data. This is great for e.g. reporting, when you are looking at older data but may not be ideal if you want to analyze what is going on right now.

I would recommend you use the solution that best fits your needs as I do not believe you can find any system that does not make some kind of trade-off. This is something Mark Hardwood is better at explaining and he has discussed it numerous times here if I remember correctly, but I only found this thread so far.

Topic		Replies	Views
I want to get the exact distinct count and the docs in query Elasticsearch	8	1645	September 25, 2020
Will elasticsearch SQL support for cardinality aggregation (DISTINCT) in future? Elasticsearch	4	559	October 19, 2018
High approximation error rate of cardinality aggregation for low-cardinality sets? Elasticsearch	5	2454	July 5, 2017
How to cheat unique count to not use HLL++ Elasticsearch	6	928	July 5, 2017
Precise distinct count Elasticsearch	1	307	October 19, 2020

Are there plans to support “accurate distinct”？

Related topics