During terms aggregation with partition we get sum_other_doc_count > 0 in between partitions

akleiber · December 13, 2022, 2:40pm

We are running ES 6.8 and move some documents to another ES Cluster with a more recent version. We also transform the documents during that process.

We need to group documents by a field, so we do this via a terms aggregation.
Because there are often more than the search.max_buckets settings (we can not set it higher) we use the partition option.
I understand that this is no pagination but for what we want to achieve it is fine that the partitions have different number of results.

In order to determine how many partitions we need we first run a cardinality aggregation on that field, like suggested in the docs.
We will then fetch our data with a size of 100, i.e. max 100 buckets for the aggregation per search.

What we see from the results is that sum_other_doc_count is sometimes greater than 0. And we wonder what that means in context of using partitioning.

Our queries

Determine number of partitions

GET the-index/_search
{
  "aggs": {
    "count": {
      "cardinality": {
        "field": "groupField"
      }
    }
  }, 
  "query": {
    "bool": {
      "filter": {
        "bool": {
          "must": [
            {
              "term": {
                "someField": "someValue"
              }
            }
          ]
        }
      }
    }
  },
  "size": 0
}

We now get a number back and divide that bei 100 (our aggregation size setting) so we roughly know how many partitions we need.
Let's say cardinality is 157300 we would set num_partitions to 1573.

Fetch buckets

GET the-index/_search
{
  "aggs": {
    "byGroupField": {
      "terms": {
        "execution_hint": "map",
        "field": "groupField",
        "size": 100,
        "include": {
          "partition": 123,
          "num_partitions": 1573
        }
      }
    }
  },
  "query": {
    "bool": {
      "filter": {
        "bool": {
          "must": [
            {
              "term": {
                "someField": "someValue"
              }
            }
          ]
        }
      }
    }
  },
  "size": 0
}

What we observed now is sum_other_doc_count being greater than 0 for some partitions. Even when we set num_partitions way higher than necessary (in respect to the size setting) we will get sum_other_doc_count greater 0 for many partitions.

What we expected was that sum_other_doc_count is greater 0 only for the last partition in case our size for the aggregation is too low.

Can somebody point me to some docs or post an explanation why we get sum_other_doc_count greater 0 for many partitions?

Thank you very much

Mark_Harwood1 · December 13, 2022, 3:04pm

What were you picking for num_partitions?

The routing for which terms go in which partitions is based on hash-modulo eg. term.hashcode()%num_partitions
Being based on a hashing algorithm we can expect some unevenness in numbers of terms in each partition (e.g. it might be 95, 99, 96, 100, 94, 102......)
That's why when you set size to 100 the last partition in the example above would have 2 missing terms.

akleiber · December 13, 2022, 3:09pm

1773

Ah thank you for that explanation. And those terms will also not be included in the next partition right? So we need to choose a "better" size in respect to num_partitions ?
I am still a bit confused how to avoid tose missing? terms.

Mark_Harwood1 · December 13, 2022, 3:29pm

Correct. Hash-modulo routing is a simple and efficient way to organise potentially billions of values into (roughly) equal sized groups. Given the groups can vary a little above and below your target size just set the retrieval 'size' setting to something >100 to allow for this overspill. Target-partition-size x 2 (e.g. 100 x 2 = 200) should be more than enough I'd have thought to compensate for hashing variations in partition size.

I notice your example is only sorting the terms by their value and not by anything more complex like a derived sum of sales. In the simpler cases the composite aggregation would be a better way to page through results

system · January 10, 2023, 3:29pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Elasticsearch terms aggregation with partition does not retuning equal bucket Elasticsearch	6	1111	April 11, 2022
Elasticsearch terms aggregation with partition does not honor the “size” value Elasticsearch	5	1817	May 25, 2021
Aggregations return result with sum_other_doc_count Elasticsearch	3	11410	July 5, 2017
Terms aggregation bucket Sorting is not working when using with partition Elasticsearch	13	1481	April 22, 2022
Need help with Terms Aggregation : buckets count Elasticsearch	1	325	May 12, 2020

During terms aggregation with partition we get sum_other_doc_count > 0 in between partitions

Related topics