Elasticsearch terms aggregation with partition does not honor the “size” value

I have a nested field in the ES index. I used the cardinality aggregation to get the total number of buckets and then came up with num_partitions, and size values to fetch the buckets.

E.g. the total number of term buckets = 761

Used num_partitions = 4, size = 200

But this returns me the following number of buckets in the 4 requests = 200, 194, 176, 165 Which sums to 735 < 761.

Now when using num_partitions = 3, size = 300

The 3 requests returned the following number of buckets = 246, 261, 254 Which sums to 761.

The second case did not miss any buckets but still, I would expect it to go like this = 300, 300, 161.

So, the questions are -

  • Why did the first choice of num_partitions and size miss 26 buckets?
  • Why is not honouring the "size" value in either of the above scenarios?

NOTE -

  • ES version = 5.6.1. Upgrading to ES 6 is not possible in near future. Hence can not consider using composite aggregation.
  • I have referred to this old question and one of the answer quotes the ES documentation that "The terms aggregation is meant to return the top terms and does not allow pagination." But I still do not understand why does it not return buckets in 300, 300, 161 numbers in case 2 above. Without using partitions, it has always honoured the "size" value.

In case required, here is the aggregation part of the query -

   "aggs": {
       "nested": {
           "path": "software"
       },
       "aggregations": {
           "filtered": {
               "filter": {
                   "bool": {
                       "must": [
                           {
                               "match_phrase_prefix": {
                                   "sw.publisher": {
                                       "query": "O",
                                       "slop": 100,
                                       "max_expansions": 50,
                                       "boost": 1.0
                                   }
                               }
                           }
                       ],
                       "adjust_pure_negative": true,
                       "boost": 1.0
                   }
               },
               "aggregations": {
                   "host_sw": {
                       "terms": {
                           "field": "sw.hostId",
                           "size": 200,
                           "min_doc_count": 1,
                           "shard_min_doc_count": 0,
                           "show_term_doc_count_error": false,
                           "order": [
                               {
                                   "_count": "desc"
                               },
                               {
                                   "_term": "asc"
                               }
                           ],
                           "include": {
                               "partition": 3,
                               "num_partitions": 4
                           }
                       }
                   },
                   "sw_host_id.total_count": {
                       "cardinality": {
                           "field": "sw.hostId",
                           "precision_threshold": 20000
                       }
                   }
               }
           }
       }
   }

Because you asked for max 200 results and there were 226 in one of the partitions so not all were returned.

The size is the max number of results

What may help explain the behaviour is to understand that partitioning is a rough way to split terms into groups. It is based on hashing the terms and then modulo N so there can be a small degree of unevenness in partition sizes

1 Like

THank you @Mark_Harwood , can you please elaborate more on the following?
Meaning what is N (am I being naive?) and also hashing part.

Since there can be unevenness, I suppose it can not be used for pagination?
What I am looking for is a safe formula if I can come up with so that it does not miss any buckets over the pagination.

N is the number of partitions you want to break a problem into.
We use the same algorithm to route a document with an ID to a particular choice of server.
It's a common technique and this doc discusses it.

The challenge with running aggregations on a distributed data store is you might have many unique values and want to limit the chatter required between data nodes. Take, for example, a store of tweets with many data nodes - each holding a month's worth of tweets.
Let's say you wanted to query across them all and look at all the Twitter user handles to see who had tweeted about "covid19" and how many times they had done that. That is likely to be a lot of unique user accounts so to page through them all you'd need to break the task into multiple requests. If you didn't care about the sort order you could simply page through them alphabetically using the composite aggregation with the after parameter (rather than the terms aggregation). Each of the shards can independently agree that a comes before b in the alphabet so they can agree without coordination which are the next set of accounts to return for consideration.
However, if you wanted to look at a list of the first people to tweet about covid19 the task is much harder because you'd want to sort the account IDs by ascending min date. In a distributed store this is very hard without streaming a lot of data between nodes - unlike an alphabetic sort order the data nodes have no idea if @DrFauci or @JoeShmoe should be the next candidate to be returned - they don't know for sure who tweeted first about covid 19. What we can do to make the computation simpler is make each request focus on a small subset (aka "partition") of all the millions of unique terms and sort just those by first-tweet-date. If the partitions and size of results returned are sufficiently small the sorted results from each shard for a request can be fused in a way which should be accurate.
This kind of distributed analytics is made complex in the same way the old fox, chicken and grain transportation problem is made hard by the constraint of a small boat.

I appreciate there's a lot of complex choices here so I developed a wizard you can run to walk through the options and make a choice.

2 Likes

Thanks for the detailed response.
As I mentioned in the question, I can not consider the composite aggregation option yet because we are using ES 5.6 and upgrade is not possible in a near future.

btw, that's a cool tool 'wizard'! but it also is suggesting composite aggregation for my use case.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.