Constrain Terms Aggregation by Doc Count Range

jamesaharvey · March 16, 2017, 3:17pm

I'm looking for a reliable and complete way to get Terms Aggregation doc_counts within a specific range. I'm currently using the min_doc_count to constrain on one end of the range however this is limiting for my use case.

I've already attempted to use nested bucket selector aggregation to constrain the max count of an aggregation. However, since it is nested, this only constrains the results of the original terms aggregation, so my results vary based on the value of the "size" param for my query.

Here's an example of my query against an index containing email addresses with activity "rows" that I am counting:

{
  "query": {
    "match_all": {}
  },
  "size": 0,
  "aggs": {
    "RANGE": {
      "terms": {
        "field": "email",
        "size": 1000,
        "min_doc_count": 20
      },
      "aggs": {
        "sum-for-bucket-selector": {
          "value_count": {
            "field": "email"
          }
        },
        "max-doc-count": {
          "bucket_selector": {
            "buckets_path": {
              "count": "sum-for-bucket-selector"
            },
            "script": {
              "inline": "params.count < 40"
            }
          }
        }
      }
    }
  }
}

Further, I would also like to be able to page the result set. I've attempted this with partitioning but that would require me to know the total count of my Terms Aggregation with nested bucket selector aggregation ahead of time otherwise I get sparsely populated partitions and not true paged results.

Here's an example of that query:

{
  "from": 0,
  "size": 0,
  "aggs": {
    "RANGE": {
      "terms": {
        "field": "email",
        "include": {
          "partition": 0,
          "num_partitions": 10
        },
        "size": 500,
        "min_doc_count": 1000
      }
    }
  }
}

Any recommendations on other options for achieving this? Thanks in advance!

nik9000 · March 16, 2017, 7:22pm

Neither thing is possible to do efficiently with the documents scattered across all the shards. So Elasticsearch doesn't have a thing for it.

Now you could probably do some fancy work with a scripted_metric aggregation if you made sure to route the email all the documents for each email to the same shard. You could certainly do the doc range much more efficiently. Paging you'd have to hand implement, but if you did it based on range and name I expect you could do it.

None of those things are easy, I'm sorry to say.

system · April 13, 2017, 7:22pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Max_doc_count in terms aggragation Elasticsearch	2	4220	June 7, 2018
Nested terms aggregations - min_doc_count isn't returning key-wise(field in terms) 0 for empty buckets Elasticsearch	1	321	September 3, 2019
Terms aggregation with a limit Elasticsearch	2	1907	July 6, 2017
Find documents in an aggregation of a certain size Elasticsearch	5	422	May 2, 2018
Use nested doc_count in terms aggregation Elasticsearch	2	5067	April 24, 2017

Constrain Terms Aggregation by Doc Count Range

Related topics