Aggregation size 0 for top results


(Sylvain) #1

Hi everyone,

I'm just migrating my application from elasticsearch 1.7 to 5.6 but I'm stuck with the following aggregation which previously relies on size:"0" (removed in 5.x).

GET /stats/event/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "range": {
            "date": {
              "gte": "2017-01-02T02:00:00+01:00",
              "lte": "2018-06-21T23:59:59+02:00"
            }
          }
        },
        {
          "term": {
            "type": "booking"
          }
        }
      ]
    }
  },
  "aggregations": {
    "subgroups": {
      "terms": {
        "field": "eventDate",
        "size": 0
      },
      "aggregations": {
        "total": {
          "sum": {
            "field": "ca"
          }
        }
      }
    }
  }
}

So basically, I want to retrieve all "events" (which are, in fact, reservations for an event) in a date range, grouped by date, and get the total of their ca field. Later, in javascript I sorted all buckets by total.value and keep only the top 10.

In other words, I want to know the top 10 of the events with the most ca.

But now that "size": 0 was deleted, I have no idea of how to achieve this. For example, if I use "size": 2147483647 it render incorrect values as some reservations are ignored (yes, I have a huge database :slight_smile: ).

Maybe there is a way to ask elastic to give me only the top 10 results ... I was looking at the top_hits aggregation but I don't get if this is applicable to my problem?


#2

I think what you're looking for is the date histogram aggregation?

https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-datehistogram-aggregation.html

You can adjust the interval as you wish...

GET /stats/event/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "range": {
            "date": {
              "gte": "2017-01-02T02:00:00+01:00",
              "lte": "2018-06-21T23:59:59+02:00"
            }
          }
        },
        {
          "term": {
            "type": "booking"
          }
        }
      ]
    }
  },
  "aggregations": {
    "subgroups": {
      "date_histogram": {
        "field": "date",
        "interval": "day"
      },
      "aggregations": {
        "total": {
          "sum": {
            "field": "ca"
          }
        }
      }
    }
  }
}

(Sylvain) #3

Thanks for your hint @klof. This may answer a part of my problem, but what if I want to group by eventId or eventTheme instead of date?


#4

To group by eventId or eventTheme, you could add terms aggregations to fill this need.
Something like :

GET /stats/event/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "range": {
            "date": {
              "gte": "2017-01-02T02:00:00+01:00",
              "lte": "2018-06-21T23:59:59+02:00"
            }
          }
        },
        {
          "term": {
            "type": "booking"
          }
        }
      ]
    }
  },
  "aggs": {
    "by_date": {
      "date_histogram": {
        "field": "date",
        "interval": "day"
      },
      "aggs": {
        "total": {
          "sum": {
            "field": "ca"
          }
        }
      }
    },
    "by_event_id": {
      "terms": {
        "field": "eventId"
      },
      "aggs": {
        "total": {
          "sum": {
            "field": "ca"
          }
        }
      }
    },
    "by_event_theme": {
      "terms": {
        "field": "eventTheme"
      },
      "aggs": {
        "total": {
          "sum": {
            "field": "ca"
          }
        }
      }
    }
  }
}

(Sylvain) #5

Well, I think we fall back in the initial problem here. For example, here by_event_id only returns 10 buckets (with none of them is in the top 10). If I add "size": 2147483647, to the terms aggregation, I'll get more results but not all of them ...


#6

Yes by default ES returns 10 values in the aggregations, with the size parameter you can indeed increase the number of returned values. But this suppose you already know the number of values you expect.
It's only in ES 6 you can parse all the results in the aggregations with the composite aggregation :

https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-composite-aggregation.html

But i dont understand why you aggregate on a query (with the date range)? If you want your aggregations on all the data?
So maybe you want to aggregate only on events of type "booking", then filter your results with the date range?
In that case move the date range clause inside a post_filter:

https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-post-filter.html


(Sylvain) #7

Honestly, I'm not an elasticsearch expert... This used to work, that's why I was doing this kind of query but this may not be the most optimal way!

I tried your suggestion with post_filter (which I ignored the existence before) but I'm not sure how to achieve this:

GET /stats/event/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "range": {
            "date": {
              "gte": "2017-01-02T02:00:00+01:00",
              "lte": "2018-06-21T23:59:59+02:00"
            }
          }
        },
        {
          "term": {
            "type": "booking"
          }
        }
      ]
    }
  },
  "post_filter": {
    "terms": {
      "field": "eventId" 
    }, 
    "aggs": {
      "total": {
        "sum": {
          "field": "ca"
        }
      }
    }
  }
}

It answers me

{
  "type": "parsing_exception",
  "reason": "[terms] query does not support [field]",
  ...
}

#8

No problem!
Steps order :
1 - Query will return all events of type "booking"
2 - aggregate per eventId on the results of the query (1) with the sum of the CA
3 - filter the results of the query (1) by the date range

GET /stats/event/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "term": {
            "type": "booking"
          }
        }
      ]
    }
  },
  "post_filter": {
    "range": {
      "date": {
        "gte": "2017-01-02T02:00:00+01:00",
        "lte": "2018-06-21T23:59:59+02:00"
      }
    }
  },
  "aggs": {
    "by_event_id": {
      "terms": {
        "field": "eventId"
      },
      "aggs": {
        "total": {
          "sum": {
            "field": "ca"
          }
        }
      }
    }
  }
}

(Sylvain) #9

Maybe I'm missing something but I can't see any significant difference between this query and the very first one I used: only 10 results are returned in the terms aggregation... and they may not be the top 10.


#10

Yes unless you define the size parameter in the aggregations you will get 10 results (ES default value)...
What are the top 10 events? The ones (eventId) that have the most reservations (documents)?
Because the terms aggregation on eventId should give you the 10 best events.

So can we, build this query step by step? By removing the post_filter clause...
But you should already get the 10 best eventId in the "aggregations" object of your response.


(Sylvain) #11

FYI, an event document looks like this:

{
          "created_at": "2017-01-04T17:06:06.480+00:00",
          "updated_at": "2017-01-04T17:06:06.480+00:00",
          "type": "booking",
          "subType": "Atelier",
          "date": "2017-01-03",
          "stat": 1,
          "userId": 2396,
          "gender": "female",
          "age": 49,
          "group": "standard",
          "reservationId": 4560,
          "ca": 27,
          "name": "OPENLAB",
          "eventId": 246,
          "eventDate": "2017-01-14",
          "ageRange": "",
          "eventTheme": ""
        }

To be precise, What I need is the 10 best groups of events (grouped by eventId | date |eventTheme | name | subType | age; but only one of them at a time).

The best criteria is determined by the sum of the ca fields in a group. For exemple, I need the 10 groups by age where the sum of ca is the maximum of all the groups by age.

I hope to be clear enough... :slight_smile:


#12

Yes perfectly clear!
You're right! So I tried with the top_hits aggregation, and it should fulfill the need.
Could you try this one, and check the result?
You should get the list of 10 eventId, sorted by "sum_CA" which is an aggs of the events CA.
The "top_events" aggs lists only one event (sort by the highest CA), but you could also increase the size to get more events (reservations) for this eventId.

POST /stats/event/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "term": {
            "type": "booking"
          }
        }
      ]
    }
  },
  "aggs": {
    "top_events_id": {
      "terms": {
        "field": "eventId",
        "size": 10,
        "order": {
          "sum_CA": "desc"
        }
      },
      "aggs": {
        "top_events": {
          "top_hits": {
            "size": 1,
            "sort": [
              {
                "ca": "desc"
              }
            ]
          }
        },
        "sum_CA": {
          "sum": {
            "field": "ca"
          }
        }
      }
    }
  }
}

(Sylvain) #13

Wow! That seems to work perfectly for every cases ... :+1:
I don't know how to thank you, I would have never think about this by myself.

Maybe I can ask you some more info how does the top_hits aggregation works in this case? I can't figure out how is this query working...


#14

From what i understood, the first aggregation will be your top-level grouping, as you want to group by eventId, then the top_hits aggs will serve on how you want to aggregate the sub-data, in your exemple you could also set "top_hits": {}, if the sub-documents don't matter.
The interesting part, is the nested sum aggregation, of the CA grouped by eventId, that you can refer in the top-level aggs to sort the eventId.
It is just my understanding of this aggregation, I don't know if it's clear, but you should play with it and try the different options available.
Thank you for your use case, it's very interesting!


(system) #15

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.