ES filter composite aggregation result by doc_count

I have the following ES query:

GET my-index/_search
{
  "size": 0,
  "aggs": {
    "my_bucket": {
      "composite": {
        "size": 10000,
        "sources": [
          {
            "stk1": {
              "terms": {
                "field": "calling_number.keyword"
              }
            }
          },
          {
            "stk2": {
              "terms": {
                "field": "called_number.keyword"
              }
            }
          }
        ]
      }
    }
  }
}

Here is the output:

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 4,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "my_bucket" : {
      "after_key" : {
        "stk1" : "9",
        "stk2" : "99"
      },
      "buckets" : [
        {
          "key" : {
            "stk1" : "8",
            "stk2" : "99"
          },
          "doc_count" : 1
        },
        {
          "key" : {
            "stk1" : "9",
            "stk2" : "98"
          },
          "doc_count" : 1
        },
        {
          "key" : {
            "stk1" : "9",
            "stk2" : "99"
          },
          "doc_count" : 2
        }
      ]
    }
  }
}

I want to filter the buckets by doc_count >= 3, basically to exclude everything that is not >= 3. The result from this query will be used in Watcher to monitor activities which exceedes some threshold, in my case 3

Hi @mirokrastev ,

Thanks for your question. I don't believe there is a way to filter composite aggregations by the doc_count. Here's a reference issue discussing this:

Terms aggregations does support this using the min_doc_count: Terms aggregation | Elasticsearch Guide [8.3] | Elastic

Would a terms aggregation work for your use case?

1 Like

Hello @Jonathan_Buttner,

Thanks for the fast answer. Here is the idea that I want to accomplish:

I want to get the calling_numbers which calls another called_number more than X times (in my case 3 times) in the span of Y hours (in my case 1 hour).
Terms aggregation will not work, at least not with the approach that I have taken, because one calling_number, for example 99, could have called called_number "98" 2 times, and called_number "100" 2 times, and it will not exceed the threshold of 3 to the same called_number.

Do you have any idea how I may accomplish this task? The result will be used inside a watcher.

I'm not super familiar with watcher and sorry if I'm misunderstanding how the data is formatted but here's an example of how I think a nested terms aggregation could accomplish this:

Example
POST my-index-000001/_doc/
{
  "calling_number": "1",
  "called_number": "50"
}

POST my-index-000001/_doc/
{
  "calling_number": "2",
  "called_number": "60"
}

POST my-index-000001/_doc/
{
  "calling_number": "1",
  "called_number": "51"
}

POST my-index-000001/_doc/
{
  "calling_number": "1",
  "called_number": "50"
}

POST _search
{
  "size": 0, 
  "aggs": {
    "calling": {
      "terms": {
        "field": "calling_number.keyword"
      },
      "aggs": {
        "called": {
          "terms": {
            "field": "called_number.keyword",
            "min_doc_count": 2
          }
        }
      }
    }
  }
}
Results
{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 4,
      "relation": "eq"
    },
    "max_score": null,
    "hits": []
  },
  "aggregations": {
    "calling": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "1",
          "doc_count": 3,
          "called": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [
              {
                "key": "50",
                "doc_count": 2
              }
            ]
          }
        },
        {
          "key": "2",
          "doc_count": 1,
          "called": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": []
          }
        }
      ]
    }
  }
}

If you remove the "min_doc_count": 2 the results would look like:

No minimum
{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 4,
      "relation": "eq"
    },
    "max_score": null,
    "hits": []
  },
  "aggregations": {
    "calling": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "1",
          "doc_count": 3,
          "called": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [
              {
                "key": "50",
                "doc_count": 2
              },
              {
                "key": "51",
                "doc_count": 1
              }
            ]
          }
        },
        {
          "key": "2",
          "doc_count": 1,
          "called": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [
              {
                "key": "60",
                "doc_count": 1
              }
            ]
          }
        }
      ]
    }
  }
}

So it'll only return the called buckets if they meet the minimum required count.

The in the span of Y hours will be more difficult to achieve, I would look into the pipeline aggregations, maybe a sliding window would work. If not, it'd probably need to be done by the server that is doing the query for the data.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.