Accuracy of elastic search aggregation (sum) when number of unique values in greater than a million

Hi, I am new to elastic search.
lets suppose I have elastic search documents with 3 fields: date, user_id, money_spent.
I have around 10 million such documents and number of unique user_id 's is greater than 1 million.

Now I am running an aggregation query , e.g get top 2000 users , sorted by sum(money_spent).
Will this query provide accurate result.
How will the size parameter affect the accuracy.

I am asking this because I read that documents count are approximate.
https://www.elastic.co/guide/en/elasticsearch/reference/5.1/search-aggregations-bucket-terms-aggregation.html#search-aggregations-bucket-terms-aggregation-approximate-counts

"aggs": {
"groupByUserId": {
"terms": {
"field": "user_id",
"size": 2000, --> . size parameter
"order": {
"money_spent": "desc"
}
},
"aggs": {
"money_spent": {
"sum": {
"field": "money_spent"
}
}
}
}
}

Hi Prashant.
Yes, this can get complicated in a distributed system. You're on the right track but Watch out for non-zero values in doc_count_error_upper_bound in results. If this happens consider increasing shard_size setting to trade RAM for accuracy.

I've got a wizard to help with picking the right strategy for various grouping questions.
https://plnkr.co/edit/iJSFP8eRrhC7l7Hx2XOL?p=preview
The path I took for your example was as follows:

1 Like

Hi Mark, Thanks a lot for the reply.
Also the wizard you shared is helpful.

So just to confirm, for 1 million unique values , elastic search aggregations like sum mayn't be accurate,
Lets say we want top 1000 user_id's sorted by total money_spent in descending order

a. A few user_id's which should have been there in result might get missed
b. also can there be error in values of the top 1000 results i.e they may show a lower money_spent than they actually did

Also I was wondering If I use term partitioning :slight_smile:https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html#_filtering_values_with_partitions

and I have 20 partitions of 50,000 each , I will have to make 20 requests to elastic search, won't it make overall response slower.

Correct on both assumptions. Possible but not inevitable. The doc_count_error_upper_bound will tell you if errors are possible.
If you route user data to the same shard then you eliminate a lot of the problem introduced by not keeping related together in the same place when doing analysis.
Consider the scenario where you have many shards and no custom routing. Each shard may come up with a short list of big-spenders based mostly on single large payments observed on those shards. The real biggest-spender may have only have ever made small payments but these were randomly spread across a hundred shards.

Yes, but more accurate. It's done for accuracy not speed. Of course if you use custom routing based on user-ids then you help solve the data-locality problem that makes accuracy an issue in the first place. Entity-centric indexing is another level of data locality (same doc rather than same shard) and also brings other analytic possibilities e.g. behavioural analysis by maintaining user properties on update e.g. risk ratings.

1 Like

Hi Mark, this is really helpful. Thanks a lot. Just one more question,
if I make "size" parameter in "aggs" greater than total possible number of unique values then can I be sure that results will be 100% accurate.

e.g suppose I know that max 300,000 unique values are possible and I made size = 10,00,000

Will the results be 100% accurate then.

Aggs are a "one-pass" operation meaning whatever data you need to reason about in producing the final result has to be able to fit into RAM. This puts limits on what you can fit into responses from each shard. That's why we put limits on the number of "buckets" that can be produced in any response. RAM becomes the bottleneck. Large result sizes have limits.
If all the data for a user is located on a single shard or document we don't need to transport all the pieces of the puzzle over the network back to a central point to determine that person's net worth. Routing data in this way avoids some of the issues. Each shard only holds a fraction of all the users but conveniently holds ALL the data for them. In this scenario, shard_size does not have to be greater than the final size parameter because we can we be certain that each shard has fully considered all the data for each user.
However, in a randomised distribution (the default elasticsearch configuration, so no custom routing) then shard_size needs to be greater than the final size. This is required to pull back more incomplete results for more users to glue them together to complete the picture. These inflated shard_size settings may hit RAM limits and results could become inaccurate. But, by using the partitioning feature of the terms agg you limit any analysis to a subset of all the unique IDs. Your client code has to make multiple requests but all the shards are returning manageable subsets of the overall data.

1 Like

Hi Mark, So the only sure way to get 100% accurate aggregation results is to ensure that the data required to prepare the response always fits in RAM.

But its hard to do this in a distributed environment where multiple queries can come at the same time.

We can achieve better accuracy through routing and term partitions , but still we cannot guarantee that results will always be 100% accurate even then.

Am I correct. Because In my case accuracy is critical and I may need to change the way I store data.

Thanks.

No, you can guarantee accuracy. Routing ensures you fully consider all userids locally so there's no inaccuracies in the numbers coming out of each shard.
Term partitioning requires you to pick a number of partitions and suitable shard_size which reports zero error bounds in the results.

2 Likes

Thanks a lot Mark. This is really helpful.
I will start with Routing and Term partitioning and see how it goes.
I will come back to you incase I have any more questions.

Thanks again. :+1:

1 Like

Bear in mind that routing is a strategy that only solves data locality in a single-index system.

In a system that uses multiple time-based indices (a common pattern to help with data ageing) then you still have the problem of each user's data being distributed across multiple shards - just shards in different indices. Routing should help with the RAM/accuracy trade-offs because more of the data joining can be achieved locally on each shard but you will still need to set shard_size > size in order to join up user documents held in different indices. If the shard_size setting required to guarantee accuracy exceeds RAM then you may need to resort to term partitioning.

2 Likes

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.