Question about approximate counts

We're currently evaluating Elasticsearch, and trying to get a handle on the approximate counts that it uses. I know that in cardinality aggregations, the default value for 'precision_threshold' is 3000. I think this means that if I get back a number that is 3000 or less, then it should be an exact answer.

My question is about for other types of counts. Is the same thing true for those? Or are there any queries (e.g. Top 10 queries, etc) where it's not true -- could I get a value back of 40, when the actual answer is 41 or 42?

Computing exact counts is expensive because you need to keep a fully copy of the original values in memory e.g. all the url strings to count number of pages visited.
The precision threshold is a count where we change strategy. Below the threshold we count by keeping all the values in memory (actually, the hashes of the values which itself can technically miscount if there is a hash collision). Above the threshold we use a more space-efficient structure but it is less accurate.

Theres is the potential for inaccuracy in the terms aggregation but we specifically warn when that is the case with the error margin and we have tweakable parameters to address the problem.

Thanks Mark. Our product team is OK with approximate counts for large numbers, but they are understandably more concerned about approximate counts for low numbers. When doing a cardinality query that results in a single number, it's very reassuring to read about the precision setting. It's basically saying that any number below 3000 should be an exact count.

But if I have a query like "Top 20 users who did X", the counts in that table could be approximations even if they are much smaller than 3000 right? They could be off by around 2%, so if a number is displayed as 50, then the actual count count be 49 or 51?

In the more extreme cases, yes. Let's say that the requirement was "top 10 people who got married most" - that's a rare event for most people so if your database design was poor and you used a distributed index to store marriage certificates each shard might possibly have a count of only "1" for each person ( people with multiple marriages may have each of their certificates held on different shards). When each shard is asked for their top 10 people (or even top 100 because of the shard_size multiplier) they are collectively likely to fail to identify the people married 3 times. Each shard will pick an arbitrary selection of the millions of people who have all been married only once for return to the coordinating node.

In these cases indexing a summary document per person or using routing would be a way to avoid the possibilities for top N inaccuracies introduced by a distributed system. It's also possible to page through data using the "composite" aggregation to get a full picture of each term. You may find running this wizard useful in choosing the right approach.

Thanks Mark. You suggested that you could use routing... specifically, you would route by person ID in your example, to ensure that each node can get an exact count of marriages for that person? And then the overall Top 10 would always be exact?

Yes, by ensuring all related data is held on the same machine your query would do a complete analysis of each person and be 100% accurate.

Marriage data is an exception though, being an example of something that's incredibly flat - most people having only 1 record. However, most data is Zipf-ey in its distribution and you can normally rely on the fact the top 10% of entities have massively more documents than the bottom 90%. In this case each shard will narrow in on the same main-culprits because each node can see a healthy sample of these entities' prolific activity.
Using routing with Zipf-ey data can often be counter-productive because the imbalances are so large between the haves and the have-nots that you are likely to find some nodes overwhelmed with a more prolific entity's data.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.