"size" Restrictions on non-terms Bucket Aggregations

I have a query which gets various cpu/memory/disk stats about my servers. Unfortunately, some of the stats like network and swap io must be calculated with a derivative filter with relatively high granularity. But I don't care about the individual values, I'm looking for max/median/average (extended_stats) on the derivatives.

Unfortunately, I can't find a way to not return the individual values of these inner buckets which is a problem because I have enough data that returning it and processing it on the client is infeasible (It uses more than the 16GB of memory I have to load up the JSON into python).

What I want is to simply return the aggregate calculations but not the actual buckets, however the size parameter does not appear to work on the inner aggregates (that aren't terms).

Is there a workaround? Are there plans to add "size" as a parameter to non-terms aggregations? Is there some meta-option to just say "I don't care about the values in the buckets, just calculate on them"?

Perhaps an example query would be useful:

{
"size": 0,
"query": {
"filtered": {
"filter": {
"bool": {
"must": [
{
"range": {
"@timestamp": {
"gte": 1464640503000,
"lte": 1466454903000,
"format": "epoch_millis"
}
}
}
],
"must_not":
}
}
}
},
"aggs": {
"server_values": {
"terms": {
"field": "server",
"exclude": "zt-.*"
},
"aggs": {
"swap_in": {
"date_histogram": {
"field": "@timestamp",
"interval": "6h",
"time_zone": "America/Chicago",
"min_doc_count": 0
},
"aggs": {
"swap_stats": {
"date_histogram": {
"field": "@timestamp",
"interval": "5m",
"time_zone": "America/Chicago",
"min_doc_count": 0
},
"aggs": {
"swap_avg": {
"avg": {
"field": "mem.swap.sin"
}
},
"swap_delta": {
"derivative": {
"buckets_path": "swap_avg",
"size": 0
}
}
}
},
"value": {
"extended_stats_bucket": {
"buckets_path": "swap_stats>swap_delta"
}
}
}
}
}
}
}
}

I do want the "value" field and its date_histogram buckets, but I don't care about any of the internal buckets for swap_avg or swap_delta. Those are simply being used to generate the differential data. And because they are higher resolution, the data they return dominates the query.

My current work around is essentially using awk to filter those lines out. It's ugly and kludgy and it means that the text is still having to travel over the network, so simply being able to say "Don't show me that" would be much nicer.

Is there a way?

You can filter that out server-side by using filter_path response filtering.

So in your case, you can probably do something like ?filter_path=aggregations.**.value, which will only show you the buckets named "value" under the "aggregations" element, regardless of how deep it is. You can also path to it directly, but the ** tends to make life more convenient.

Note: this is still generating all the buckets on the server, it's just performing the filter (aka awk'ing) on the response. So you still pay the price for generating all the buckets, but can save on networking. Which sounds exactly like what you wanted, but I wanted to make the note just in case :slight_smile: