POC elastic search - correctness & exactitude of stats

Hi,

We're about to start a POC with elastic search.
Prior to posting here, I read so many blogs/answers/topics/... regarding the fact ES is fast, but can return incorrect or approximate results...

What we need to be able to achieve is complex aggregations and "stats" in general on a fairly big amount of data.
We of course need averages, Min, Max, counts, group by, etc. to be precise & exact. We can't miss out a few records.

Is this going to be a problem with ES?

Thanks a lot for your help
Regards

Hi

Where did you read that ?

but can return incorrect or approximate results...

For me the only point where there is an approximate result is in aggregations: cardinality aggregation

bye,
Xavier

If stats results are accurate and no records are missed out, ever, then what do they mean in this forum post?

Thanks a lot for helping out & clarifying where I may find blogs & other replies slgithly unclear
Thanks!

It depends.
If you are doing this analysis on low-to-middle cardinality fields (those with relatively few unique values e.g. "suppliers") then numbers will be accurate - and we will tell you that they are accurate.

If you are doing this analysis on high-cardinality fields with millions of unique values e.g. IP address then we have some potential for inaccuracies - which we measure and report.

An example - finding the top 10 IP address with the highest SUM of bytes transferred might be accurate. Each data server would return their top N high-activity ip addresses (where N is greater than 10 but less than millions for efficiency's sake). The final results are summed and we may end up with stats for 100 IP addresses and take the final top 10. We can tell you if this figure is guaranteed to be accurate.

However - the reverse of this scenario (the 10 lowest-activity IP addresses) is likely to be inaccurate. Each data server would return the N ip addresses with the least amount of activity and the final result might be wildly inaccurate - an IP address may have recorded a lot of activity on one data server so wasn't returned in its top N choices. That missing data would have a big impact on final results (and again, we tell you that).

Usually people are looking for "the biggest N" of something so the results are more trustworthy.

Speed, accuracy and size is a "pick-2 of 3" trade-off people have to make which is a problem for all distributed systems.

1 Like

Ok great, good to know.

What about such operations, on ip addresses again for example, where I want to group by 192.168.100.* vs 192.168.0.*
Am I going to get inacurrate results?
Or what if I wanted to get all ip addresses 192.168.100.66 ?

Trying to wrap my head around accuracy in these filtering conditions.

Thanks!

Think of dataservers like individual people. They each hold a bunch of data in their heads and you get to ask them questions and they reply with samples of what they hold in their heads.

If you ask 10 people for their favourite primary colour you can determine the overall number one result for the group fully accurately.
However, if you ask 10 people for their favourite music album you may not determine the overall number one result of the group accurately. Because there are so many albums to choose from each person may have replied with an entirely different album choice giving you a list of 10 different albums all liked by only one person. To get the true answer you'd have to ask each person for more than one album choice, say their top 20, and then you'd get the right answer.
20 is probably a good number.
A million would be overkill - theoretically it could be necessary but highly unlikely in most data distributions. We pick these "over-counting" numbers for you based on the number of data servers you query and the number of final results you want back. We can also figure out how inaccurate we are by returning the score of the 21st value that didn't make the cut.

So accuracy depends on the number of unique values (primary colours or music albums?) that you are trying to discover through ranking.

1 Like

Ok I see,

so for users logging in from the same ip address, in the case below, I may not be able to distinguish them all and count that 4 users logged in from 192.168.1.1 and 3 only from 192.168.1.2 (imagining we're talking about thousands of records of course).

192.168.1.1 neo
192.168.1.1 trin
192.168.1.1 morf
192.168.1.1 doz

192.168.1.2 luke
192.168.1.2 anakin
192.168.1.2 yoda

What do you think?

It depends.

I assume in this question you're trying to find the top N IP addresses based on the number of unique users seen to be logging in from that IP.

That potential for inaccuracy depends on how the data is distributed.

In the best-case scenario for accuracy, you have all the data in a single shard and the results are fully accurate.

Let's paint the worst-case scenario:

  1. You spread the data across many shards (to help with parallelised reads/writes)
  2. You use multiple time-based indices (e.g. one new index per day, retain last 60 days' indices)
  3. You have billions of IP addresses
  4. The most-shared IP addresses only have 2 users logging in from them

That's a bad case because the two shared logins may be on different data servers so to each data server every single login is a potential candidate of interest and all servers would have to return all docs to a central point for full accuracy (this doesn't scale).

Most systems are somewhere in between this best-case and worst-case scenario. The things of interest you're looking for occur more frequently than twice and you may have less unique terms or shards to consider.
However, for these worst-case scenarios the solution is to bring related data more closely together using entity-centric indexing. At the end of the day we are at the mercy of physics.

Thanks for that. That clarifies the issue.

I guess I have slight lack of faith that I can force real world data closer together in the ong term, but ES sounds interesting nevertheless for our POC so I'll give it a shot first.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.