Are there plans to support “accurate distinct”?

Elasticsearch aggregations work in near-real time on the data that is available in the cluster and results change as data is being indexed, updated or removed. Running accurate distinct queries efficiently across large amounts of data distributed across large number of nodes is prohibitively expensive, so you always need to make a trade-off. You can generally achieve at most two of these (fast queries, handle large amounts of data and having accurate results) and Elasticsearch often chooses to sacrifice accuracy in favour of querying large data volumes fast.

Different systems however make different trade-offs. I am no expert on OLAP tools but believe they often create a view/cube by processing a snapshot of the data at a point in time that then can be navigated and support these types of queries efficiently. Creating this OLAP cube can however take considerable time and it is only once it is created you benefit from fast queries. Once it has been created it also needs to be updated in order to not grow more and more stale. If the cube does not cover all aspects of your data you may need to create new ones for different types of queries. The data that forms part of the cube can be accurately queried, but that may not always represent the current state of the data. This is great for e.g. reporting, when you are looking at older data but may not be ideal if you want to analyze what is going on right now.

I would recommend you use the solution that best fits your needs as I do not believe you can find any system that does not make some kind of trade-off. This is something Mark Hardwood is better at explaining and he has discussed it numerous times here if I remember correctly, but I only found this thread so far.

1 Like