POC elastic search - correctness & exactitude of stats

It depends.
If you are doing this analysis on low-to-middle cardinality fields (those with relatively few unique values e.g. "suppliers") then numbers will be accurate - and we will tell you that they are accurate.

If you are doing this analysis on high-cardinality fields with millions of unique values e.g. IP address then we have some potential for inaccuracies - which we measure and report.

An example - finding the top 10 IP address with the highest SUM of bytes transferred might be accurate. Each data server would return their top N high-activity ip addresses (where N is greater than 10 but less than millions for efficiency's sake). The final results are summed and we may end up with stats for 100 IP addresses and take the final top 10. We can tell you if this figure is guaranteed to be accurate.

However - the reverse of this scenario (the 10 lowest-activity IP addresses) is likely to be inaccurate. Each data server would return the N ip addresses with the least amount of activity and the final result might be wildly inaccurate - an IP address may have recorded a lot of activity on one data server so wasn't returned in its top N choices. That missing data would have a big impact on final results (and again, we tell you that).

Usually people are looking for "the biggest N" of something so the results are more trustworthy.

Speed, accuracy and size is a "pick-2 of 3" trade-off people have to make which is a problem for all distributed systems.

1 Like