Aggregation Group By With Total Count

HI,

I am stuck with a use case where I need to group N number of persons by company and get the total count of grouped persons across the pages. So far the query I have prepared is working fine up to the level where the grouping is performed but I need two things:

  1. List 50 records per page and use the pagination for more records
  2. Get the total count of persons grouped by company across the pages to set the total number of records for pagination on the front end.
{
  "size": 0,
  "aggs": {
    "group_by_company": {
      "composite": {
        "size": 50,
        "sources": [
          {
            "company_name": {
              "terms": {
                "field": "company_name.keyword"
              }
            }
          }
        ]
      },
      "aggs": {
        "persons": {
          "top_hits": {
            "size": 1,
            "_source": ["person_id",  "company_id", "company_name"]
          }
        },
        "persons_count": {
          "terms": {
            "field": "person_id",
            "size": 1
          }
        }
      }
    },
    "total_persons_count": {
      "sum_bucket": {
        "buckets_path": "group_by_company>persons_count"
      }
    }
  }
}

This is the response of a single bucket from the above query:

{
          "key": "",
          "doc_count": 698478,
          "persons": {
            "hits": {
              "total": {
                "value": 698478,
                "relation": "eq"
              },
              "max_score": 1,
              "hits": [
                {
                  "_index": "persons",
                  "_id": "5392465",
                  "_score": 1,
                  "_source": {
                    "person_id": 5392465,
                    "company_name": "",
                    "company_id": "0"
                  }
                }
              ]
            }
          },
          "persons_count": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 698477,
            "buckets": [
              {
                "key": 3,
                "doc_count": 1
              }
            ]
          }
        }

So in this bucket, we have the persons_count which counts the top hits, now what I want is, to sum up the count of hits array or the persons_count bucket across all the pages. I am trying to sum up the persons_count bucket but it gives an error due to the composite. Please see the following error:

{
  "error": {
    "root_cause": [
      {
        "type": "action_request_validation_exception",
        "reason": "Validation Failed: 1: Unable to find unqualified multi-bucket aggregation in buckets_path. Path must include a multi-bucket aggregation for aggregation [total_persons_count] found :org.elasticsearch.search.aggregations.bucket.composite.CompositeAggregationBuilder for buckets path: group_by_company>persons_count;"
      }
    ],
    "type": "action_request_validation_exception",
    "reason": "Validation Failed: 1: Unable to find unqualified multi-bucket aggregation in buckets_path. Path must include a multi-bucket aggregation for aggregation [total_persons_count] found :org.elasticsearch.search.aggregations.bucket.composite.CompositeAggregationBuilder for buckets path: group_by_company>persons_count;"
  },
  "status": 400
}

Any help would be appreciated. Please let me know if you need any other information.

I believe the terms agg under your persons_count is actually returning a single person_id value and that the 3 in your sample is actually a person_id and not a count of persons.

Your sum multi bucket agg then fails because of this.

You could use a cardinality agg with high precision but I think that Elasticsearch provides a "special" key for getting the size of the bucket and that's _count.

As a result you could skip the persons_count altogether and try changing your sum agg to just

"total_persons_count": {
      "sum_bucket": {
        "buckets_path": "group_by_company>persons>_count"
      }
    }

You could do something similar for calculating the per bucket size for persons_count if you need the per bucket size in your client app.

I haven't tested this directly but if I have time later today may give it a shot.

Let me know how it goes

Persons_count is counting the number of items from the hits array in persons, and the 3 is just a random key not the actual person_id.

I tried the _count on persons but it still throws the error i.e. path not supported:

{
  "error": {
    "root_cause": [],
    "type": "search_phase_execution_exception",
    "reason": "",
    "phase": "rank-feature",
    "grouped": true,
    "failed_shards": [],
    "caused_by": {
      "type": "illegal_argument_exception",
      "reason": "path not supported for [persons]: [_count]"
    }
  },
  "status": 400
}

From your original query you don't have a persons bucket at the root, it's under group by company

So I think you'd want to use group_by_company>persons>_count

If you can share a small set of sample documents and the exact query you're running, I can give you an exact agg

So this is my original query, and I am trying to sum up the buckets from persons agg:

{
  "size": 0,
  "aggs": {
    "group_by_company": {
      "composite": {
        "size": 50,
        "sources": [
          {
            "company_name": {
              "terms": {
                "field": "company_name.keyword"
              }
            }
          }
        ]
      },
      "aggs": {
        "persons": {
          "top_hits": {
            "size": 1,
            "_source": ["person_id",  "company_id", "company_name"]
          }
        },
        "persons_count": {
          "terms": {
            "field": "person_id",
            "size": 1
          }
        }
      }
    },
    "total_persons_count": {
      "sum_bucket": {
        "buckets_path": "group_by_company>persons>_count"
      }
    }
  }
}

The above query gives the following error, and the error is same even if I use persons_count instead of persons in the buckets path:

{
  "error": {
    "root_cause": [
      {
        "type": "action_request_validation_exception",
        "reason": "Validation Failed: 1: Unable to find unqualified multi-bucket aggregation in buckets_path. Path must include a multi-bucket aggregation for aggregation [total_persons_count] found :org.elasticsearch.search.aggregations.bucket.composite.CompositeAggregationBuilder for buckets path: group_by_company>persons>_count;"
      }
    ],
    "type": "action_request_validation_exception",
    "reason": "Validation Failed: 1: Unable to find unqualified multi-bucket aggregation in buckets_path. Path must include a multi-bucket aggregation for aggregation [total_persons_count] found :org.elasticsearch.search.aggregations.bucket.composite.CompositeAggregationBuilder for buckets path: group_by_company>persons>_count;"
  },
  "status": 400
}

And if I run the following query:

{
  "size": 0,
  "aggs": {
    "group_by_company": {
      "terms": {
        "size": 50,
        "field": "company_name.keyword"
      },
      "aggs": {
        "persons": {
          "top_hits": {
            "size": 1,
            "_source": ["person_id", "person_firstname", "person_token", "company_id", "company_name"]
          }
        },
        "persons_count": {
          "terms": {
            "field": "person_id",
            "size": 1
          }
        }
      }
    },
    "total_persons_count": {
      "sum_bucket": {
        "buckets_path": "group_by_company>persons_count._bucket_count"
      }
    }
  }
}

This sums up the buckets from the persons_count but only from the first page, what I want is to sum up the bucket count from persons_count across the pages. Here is the response from the above query:

{
  "took": 685,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 10000,
      "relation": "gte"
    },
    "max_score": null,
    "hits": []
  },
  "aggregations": {
    "group_by_company": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 4270224,
      "buckets": [
        {
          "key": "",
          "doc_count": 698478,
          "persons": {
            "hits": {
              "total": {
                "value": 698478,
                "relation": "eq"
              },
              "max_score": 1,
              "hits": [
                {
                  "_index": "persons",
                  "_id": "5392465",
                  "_score": 1,
                  "_source": {
                    "person_id": 5392465,
                    "company_name": "",
                    "company_id": "0"
                  }
                }
              ]
            }
          },
          "persons_count": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 698477,
            "buckets": [
              {
                "key": 3,
                "doc_count": 1
              }
            ]
          }
        },
        {
          "key": "ABC",
          "doc_count": 23946,
          "persons": {
            "hits": {
              "total": {
                "value": 23946,
                "relation": "eq"
              },
              "max_score": 1,
              "hits": [
                {
                  "_index": "persons",
                  "_id": "5392672",
                  "_score": 1,
                  "_source": {
                    "person_id": 5392672,
                    "company_name": "ABC",
                    "company_id": "1122"
                  }
                }
              ]
            }
          },
          "persons_count": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 23945,
            "buckets": [
              {
                "key": 12,
                "doc_count": 1
              }
            ]
          }
        }
      ]
    },
    "total_persons_count": {
      "value": 2
    }

Hope you understand the problem. Let me know if you need any other information.