During terms aggregation with partition we get sum_other_doc_count > 0 in between partitions

We are running ES 6.8 and move some documents to another ES Cluster with a more recent version. We also transform the documents during that process.

We need to group documents by a field, so we do this via a terms aggregation.
Because there are often more than the search.max_buckets settings (we can not set it higher) we use the partition option.
I understand that this is no pagination but for what we want to achieve it is fine that the partitions have different number of results.

In order to determine how many partitions we need we first run a cardinality aggregation on that field, like suggested in the docs.
We will then fetch our data with a size of 100, i.e. max 100 buckets for the aggregation per search.

What we see from the results is that sum_other_doc_count is sometimes greater than 0. And we wonder what that means in context of using partitioning.

Our queries

Determine number of partitions

GET the-index/_search
{
  "aggs": {
    "count": {
      "cardinality": {
        "field": "groupField"
      }
    }
  }, 
  "query": {
    "bool": {
      "filter": {
        "bool": {
          "must": [
            {
              "term": {
                "someField": "someValue"
              }
            }
          ]
        }
      }
    }
  },
  "size": 0
}

We now get a number back and divide that bei 100 (our aggregation size setting) so we roughly know how many partitions we need.
Let's say cardinality is 157300 we would set num_partitions to 1573.

Fetch buckets

GET the-index/_search
{
  "aggs": {
    "byGroupField": {
      "terms": {
        "execution_hint": "map",
        "field": "groupField",
        "size": 100,
        "include": {
          "partition": 123,
          "num_partitions": 1573
        }
      }
    }
  },
  "query": {
    "bool": {
      "filter": {
        "bool": {
          "must": [
            {
              "term": {
                "someField": "someValue"
              }
            }
          ]
        }
      }
    }
  },
  "size": 0
}

What we observed now is sum_other_doc_count being greater than 0 for some partitions. Even when we set num_partitions way higher than necessary (in respect to the size setting) we will get sum_other_doc_count greater 0 for many partitions.

What we expected was that sum_other_doc_count is greater 0 only for the last partition in case our size for the aggregation is too low.

Can somebody point me to some docs or post an explanation why we get sum_other_doc_count greater 0 for many partitions?

Thank you very much :slight_smile:

What were you picking for num_partitions?

The routing for which terms go in which partitions is based on hash-modulo eg. term.hashcode()%num_partitions
Being based on a hashing algorithm we can expect some unevenness in numbers of terms in each partition (e.g. it might be 95, 99, 96, 100, 94, 102......)
That's why when you set size to 100 the last partition in the example above would have 2 missing terms.

1773

Ah thank you for that explanation. And those terms will also not be included in the next partition right? So we need to choose a "better" size in respect to num_partitions ?
I am still a bit confused how to avoid tose missing? terms.

Correct. Hash-modulo routing is a simple and efficient way to organise potentially billions of values into (roughly) equal sized groups. Given the groups can vary a little above and below your target size just set the retrieval 'size' setting to something >100 to allow for this overspill. Target-partition-size x 2 (e.g. 100 x 2 = 200) should be more than enough I'd have thought to compensate for hashing variations in partition size.

I notice your example is only sorting the terms by their value and not by anything more complex like a derived sum of sales. In the simpler cases the composite aggregation would be a better way to page through results

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.