How can I limit bucket to n results for averaging?


(Samthurston) #1
In the guide I saw an example: 

GET /cars/transactions/_search?search_type=count
{
   "query":{
      "match": {
         "make": "ford"
      }
   },
   "aggs":{
      "recent_sales": {
         "filter": { 
            "range": {
               "sold": {
                  "from": "now-1M"
               }
            }
         },
         "aggs": {
            "average_price":{
               "avg": {
                  "field": "price" 
               }
            }
         }
      }
   }
}

This is pretty close to what I want to do, but instead of a range of dates, I just want the 5 most recent results in each bucket, and I want to be able to average the price.

I have tried the limit filter in a bucket, but that doesn't do what I would expect it to do (documents per shard? really?) and it's deprecated anyway.

I also tried to use the top_hits aggregation but that won't allow me to average the results because it doesn't accept sub-aggregations.

I have tried to apply multiple filters, set size options, and all other sorts of things with no correct results. the closest I seem to be able to find to this problem is

Top N documents with multiple buckets

But I don't actually understand what he's doing and I don't think the terms aggregator applies since i'm trying to sort and limit by timestamps of which there are many many duplicates. I'm under some time pressure here and it seems like if I was doing this with a traditional SQL query I'd have been done yesterday. Any help is much appreciated.


(Pieter Agenbag) #2

Hi.

I dont think I understand what You are trying to do :wink:

You want an avg aggregation , but avg of which buckets ,because you siad you Dont want a date range ?
So avg of each unique timestamp ?

I dont have an answer for you , but maybe have a look at the aggregation order , you can order a bucket by a field or metric other than the bucket itself.
? Get the avg , but order by timestamp ?


(Samthurston) #3

I think the best way to put it is, I want the average price of the last 5 sales in each bucket. some of those sales could be years old so I can't define a time range to guarantee 5 sales.

I don't think the terms agg will work because neither my buckets nor my prices or timestamps are "terms" in that sense. the buckets are geographic regions, and the geo_bounds agg doesn't allow you to set per-bucket limits as far as i can tell. my problem isn't so much getting them ordered correctly by date, as limiting the count that avg considers to only 5 per bucket


(Pieter Agenbag) #4

Ah , I see.
I dont have a solution for that.. but will give it some thought.
Sounds like something you need ES 2.0's pipeline aggregations for.

If you Do find a solution, please post it here - sounds like a really useful thing to be able to do


(Mark Harwood) #5

In 2.0 (currently in beta) you can use the sampler agg to limit any child aggs (e.g. your avg) to the top N scoring docs:

https://www.elastic.co/guide/en/elasticsearch/reference/2.0/search-aggregations-bucket-sampler-aggregation.html


(Samthurston) #6

That actually works pretty well, provided I want more docs than I have shards, which isn't necessarily the case. But this will have to do. Thanks.


(system) #7