Constrain Terms Aggregation by Doc Count Range

I'm looking for a reliable and complete way to get Terms Aggregation doc_counts within a specific range. I'm currently using the min_doc_count to constrain on one end of the range however this is limiting for my use case.

I've already attempted to use nested bucket selector aggregation to constrain the max count of an aggregation. However, since it is nested, this only constrains the results of the original terms aggregation, so my results vary based on the value of the "size" param for my query.

Here's an example of my query against an index containing email addresses with activity "rows" that I am counting:

{
  "query": {
    "match_all": {}
  },
  "size": 0,
  "aggs": {
    "RANGE": {
      "terms": {
        "field": "email",
        "size": 1000,
        "min_doc_count": 20
      },
      "aggs": {
        "sum-for-bucket-selector": {
          "value_count": {
            "field": "email"
          }
        },
        "max-doc-count": {
          "bucket_selector": {
            "buckets_path": {
              "count": "sum-for-bucket-selector"
            },
            "script": {
              "inline": "params.count < 40"
            }
          }
        }
      }
    }
  }
}

Further, I would also like to be able to page the result set. I've attempted this with partitioning but that would require me to know the total count of my Terms Aggregation with nested bucket selector aggregation ahead of time otherwise I get sparsely populated partitions and not true paged results.

Here's an example of that query:

{
  "from": 0,
  "size": 0,
  "aggs": {
    "RANGE": {
      "terms": {
        "field": "email",
        "include": {
          "partition": 0,
          "num_partitions": 10
        },
        "size": 500,
        "min_doc_count": 1000
      }
    }
  }
}

Any recommendations on other options for achieving this? Thanks in advance!

1 Like

Neither thing is possible to do efficiently with the documents scattered across all the shards. So Elasticsearch doesn't have a thing for it.

Now you could probably do some fancy work with a scripted_metric aggregation if you made sure to route the email all the documents for each email to the same shard. You could certainly do the doc range much more efficiently. Paging you'd have to hand implement, but if you did it based on range and name I expect you could do it.

None of those things are easy, I'm sorry to say.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.