Sum Aggregation returning bad value

Hi,
I am using a simple sum aggregation query. It's fine when I have a small count of documents. But if the query needs to aggregate from 10 million filtered documents, the sum value is not correct. I found that the solution is to increase shard_size parameter. But this is a temporary solution. Maybe there is a better solution to do SUM aggregation?
Elastic search version 5.5
My query:

        {
                  "query": {
                    "bool": {
                      "must": [
                        {
                          "terms": {
                            "round_id": ["1","2","3"]
                          }
                        }
                      ],
                      "must_not": [
                        {
                          "term": {
                            "user_id": 0
                        }
                      }
                      ]
                    }
                  },
                  "size": 0,
                  "aggs": {
                    "groups": {
                      "terms": {
                        "field": "user_id",
                        "size": 10,
                        "order": {"total_points_sum": "desc"}
                      },
                      "aggs": {
                        "total_points_sum": {
                          "sum": {
                            "field": "points"
                          }
                        }
                      }
                    }
                  }
                }

This query gets more than 10 million documents. I need only 10 aggregated results with the biggest sum value.
I would be very grateful if someone could help me.

Hi Nerijus,

Why do you say increasing shard_size is only a temporary solution?
We can talk about alternative strategies like term partitioning or entity-centric indexes but first it would be good to know if it's important that the top 10 users are the true top 10 (all things considered) or that the sums shown for the (maybe slightly inaccurate) top 10 are accurate numbers for those users. If the latter then you just need to run a follow-up request where your terms agg lists the user-ids in the includes clause.

I think that shard_size is temp solution because if I will assign the value lets to say 10000 after some time when I will have more documents again I will need to increase shard_size. I am not sure if it's ok to add very big shard_size value. How much we can increase it? At this moment we see that increased shard_size helps to get not approximately data.
The 10 users I get are not the same which I should get with the correct sum values. So I can not run follow-up with these 10 users ids.
Maybe it's possible to write the query in another way?

Given you're asking for the biggest users (as opposed to the smallest sums) the shard_size required should be manageable, meaning it should fit into RAM used for a single request).
The way to think about this is that querying many shards for the top N of something is like asking a group of people for their favourite album in order to find their shared favourite. This question is effectively shard_size=1 and could lead to an inaccurate result (every person in the group returns a personal favourite that is not shared by any other member of the group). If you asked each person for the top 10 albums (shard_size=10) then you might discover the group's favourite album was Nirvana's NeverMind, (it being ranked 5th, 7th and 9th by 3 different people).
In theory, each person might have such differing tastes that you'd have to ask them for their top million albums before you'd discover anything they have in common. That's theoretically possible but in practice highly unlikely. The same is true if you randomly distribute data across shards.

We report back on the bounds of the inaccuracies in results so you can adjust shard_size accordingly. If you do hit memory errors with large shard_sizes then you can either:
A) Break your query into multiple requests using the partitioning feature for terms aggs or
B) Bring related data closer to each other at index time using routing (ensures same shard) or entity-centric indexing (same document).

Thanks for your answers. To be sure if I understood correctly.
A) is partitioning feature actually same like scroll but its for aggregation? so we need to run lot of queries and just increase partition number? How I understand this works we need to count unique users id using cardinality aggregation for num_partitions. But cardinality is also approximately number. how to get exact number to know partition number count then? https://www.elastic.co/guide/en/elasticsearch/reference/master/search-aggregations-bucket-terms-aggregation.html#_filtering_values_with_partitions
B) its had to find more info about entity-centric indexing ... And I did not get how it works. Maybe you can provide link to documentation about this?

It ensures that all shards look at the same arbitrary subset of terms. For each term we compute hashcode modulo N and if the answer comes out as the chosen partition number then independent shards can agree that this term is one of the terms we are considering in this pass over the data. By reducing the number of unique terms considered in any one request we can see the reported error margins decrease towards zero as we are able to pack fully complete details for considered terms into the space-limited shard responses

For entity centric indexing see talk and example scripts/data here: https://twitter.com/elasticmark/status/1009380268409610240?s=21

Thanks for your replay.
So to resume the best way how I understand get accurate results not approximate is to increasing "shard_size".
And the last things:

  1. From your experience what is the best way to calculate "shard_size" size. For example instance is 32 Memory (GiB).
  2. Now for example we have about 25 millions records in one index and use "SUM" query like wrote on first post. If we will split to the smaller index (daily index) will it be a big impact to get the better accuracy of the results?
    Thanks.

The lowest number that produces zero reported error margin in the results - plus a factor to allow for future growth.

Creating more division between related content will only exacerbate the accuracy concerns. It is precisely because you don’t keep all data in the same place that we have accuracy issues. We use distributed indices to deal with scale or ease deletion of old data, not improve accuracy.

Thank you for your patience and answers.

1 Like

You’re very welcome

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.