Heya @YvorL sorry again for the runaround between here and github
I just left a reply because a colleague of mine spotted a detail: the terms agg is sorted by cardinality, which means we have to execute in depth-first rather than breadth-first mode. That unfortunately makes the agg very expensive and moves it into the "abusive" category, even though it seems very innocent from the outside perspective.
For posterity, here's what I mentioned in https://github.com/elastic/elasticsearch/issues/55240:
So as it turns out, jimczi noticed a detail that I missed: the terms agg is sorted by the cardinality agg:
Which actually puts the query into the "abusive" category, even though it looks pretty innocent. What happens is that sorting by sub-agg means we have to execute the terms aggregation in
Normally, terms aggs execute
breadth_first , meaning it collects the list of terms and their counts, finds the
top-n and prunes away all the rest of the terms. It then executes the next layer of aggregations on those
top-n buckets. We can do this pruning because the
top-n is determined by count.
But when you sort by a sub-aggregation, there is no way to know which buckets to prune because it's dependent on not-yet-calculated quantity. This means we have to switch to
depth_first , where we fully process the entire aggregation tree before we can do any pruning of results.
In the above query, this means we actually collect
instance.keyword number of buckets (for each 30s interval), with a corresponding HLL sketch for the cardinality. This can get very expensive very quickly if
instance.keyword has a moderate cardinality, and will quickly trip the circuit breaker as you're seeing.
So I think this really is a case of "expensive query" tripping the breaker, even though it looks very innocent to a user.
I'll chat with the Kibana folks and see if there is some way to proactively warn users about sorting by sub-agg, particularly by
cardinality which is relatively expensive (~kilobytes per bucket, rather than ~bytes). Or perhaps Kibana can default to a lower precision when it sees the user is sorting by cardinality. We might also want to trip the agg circuit breaker faster than 70-80% to help make this less impactful on nodes.
Apologies for missing this detail earlier!