Aggregation size 0 for top results

sylvainb · June 12, 2018, 10:31am

Hi everyone,

I'm just migrating my application from elasticsearch 1.7 to 5.6 but I'm stuck with the following aggregation which previously relies on size:"0" (removed in 5.x).

GET /stats/event/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "range": {
            "date": {
              "gte": "2017-01-02T02:00:00+01:00",
              "lte": "2018-06-21T23:59:59+02:00"
            }
          }
        },
        {
          "term": {
            "type": "booking"
          }
        }
      ]
    }
  },
  "aggregations": {
    "subgroups": {
      "terms": {
        "field": "eventDate",
        "size": 0
      },
      "aggregations": {
        "total": {
          "sum": {
            "field": "ca"
          }
        }
      }
    }
  }
}

So basically, I want to retrieve all "events" (which are, in fact, reservations for an event) in a date range, grouped by date, and get the total of their ca field. Later, in javascript I sorted all buckets by total.value and keep only the top 10.

In other words, I want to know the top 10 of the events with the most ca.

But now that "size": 0 was deleted, I have no idea of how to achieve this. For example, if I use "size": 2147483647 it render incorrect values as some reservations are ignored (yes, I have a huge database ).

Maybe there is a way to ask elastic to give me only the top 10 results ... I was looking at the top_hits aggregation but I don't get if this is applicable to my problem?

klof · June 12, 2018, 10:45am

I think what you're looking for is the date histogram aggregation?

https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-datehistogram-aggregation.html

You can adjust the interval as you wish...

GET /stats/event/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "range": {
            "date": {
              "gte": "2017-01-02T02:00:00+01:00",
              "lte": "2018-06-21T23:59:59+02:00"
            }
          }
        },
        {
          "term": {
            "type": "booking"
          }
        }
      ]
    }
  },
  "aggregations": {
    "subgroups": {
      "date_histogram": {
        "field": "date",
        "interval": "day"
      },
      "aggregations": {
        "total": {
          "sum": {
            "field": "ca"
          }
        }
      }
    }
  }
}

sylvainb · June 12, 2018, 10:52am

Thanks for your hint @klof. This may answer a part of my problem, but what if I want to group by eventId or eventTheme instead of date?

klof · June 12, 2018, 11:11am

To group by eventId or eventTheme, you could add terms aggregations to fill this need.
Something like :

GET /stats/event/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "range": {
            "date": {
              "gte": "2017-01-02T02:00:00+01:00",
              "lte": "2018-06-21T23:59:59+02:00"
            }
          }
        },
        {
          "term": {
            "type": "booking"
          }
        }
      ]
    }
  },
  "aggs": {
    "by_date": {
      "date_histogram": {
        "field": "date",
        "interval": "day"
      },
      "aggs": {
        "total": {
          "sum": {
            "field": "ca"
          }
        }
      }
    },
    "by_event_id": {
      "terms": {
        "field": "eventId"
      },
      "aggs": {
        "total": {
          "sum": {
            "field": "ca"
          }
        }
      }
    },
    "by_event_theme": {
      "terms": {
        "field": "eventTheme"
      },
      "aggs": {
        "total": {
          "sum": {
            "field": "ca"
          }
        }
      }
    }
  }
}

sylvainb · June 12, 2018, 12:18pm

Well, I think we fall back in the initial problem here. For example, here by_event_id only returns 10 buckets (with none of them is in the top 10). If I add "size": 2147483647, to the terms aggregation, I'll get more results but not all of them ...

klof · June 12, 2018, 1:49pm

Yes by default ES returns 10 values in the aggregations, with the size parameter you can indeed increase the number of returned values. But this suppose you already know the number of values you expect.
It's only in ES 6 you can parse all the results in the aggregations with the composite aggregation :

https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-composite-aggregation.html

But i dont understand why you aggregate on a query (with the date range)? If you want your aggregations on all the data?
So maybe you want to aggregate only on events of type "booking", then filter your results with the date range?
In that case move the date range clause inside a post_filter:

https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-post-filter.html

sylvainb · June 12, 2018, 2:16pm

Honestly, I'm not an elasticsearch expert... This used to work, that's why I was doing this kind of query but this may not be the most optimal way!

I tried your suggestion with post_filter (which I ignored the existence before) but I'm not sure how to achieve this:

GET /stats/event/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "range": {
            "date": {
              "gte": "2017-01-02T02:00:00+01:00",
              "lte": "2018-06-21T23:59:59+02:00"
            }
          }
        },
        {
          "term": {
            "type": "booking"
          }
        }
      ]
    }
  },
  "post_filter": {
    "terms": {
      "field": "eventId" 
    }, 
    "aggs": {
      "total": {
        "sum": {
          "field": "ca"
        }
      }
    }
  }
}

It answers me

{
  "type": "parsing_exception",
  "reason": "[terms] query does not support [field]",
  ...
}

klof · June 12, 2018, 2:29pm

No problem!
Steps order :
1 - Query will return all events of type "booking"
2 - aggregate per eventId on the results of the query (1) with the sum of the CA
3 - filter the results of the query (1) by the date range

GET /stats/event/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "term": {
            "type": "booking"
          }
        }
      ]
    }
  },
  "post_filter": {
    "range": {
      "date": {
        "gte": "2017-01-02T02:00:00+01:00",
        "lte": "2018-06-21T23:59:59+02:00"
      }
    }
  },
  "aggs": {
    "by_event_id": {
      "terms": {
        "field": "eventId"
      },
      "aggs": {
        "total": {
          "sum": {
            "field": "ca"
          }
        }
      }
    }
  }
}

sylvainb · June 13, 2018, 7:54am

Maybe I'm missing something but I can't see any significant difference between this query and the very first one I used: only 10 results are returned in the terms aggregation... and they may not be the top 10.

klof · June 13, 2018, 10:25am

Yes unless you define the size parameter in the aggregations you will get 10 results (ES default value)...
What are the top 10 events? The ones (eventId) that have the most reservations (documents)?
Because the terms aggregation on eventId should give you the 10 best events.

So can we, build this query step by step? By removing the post_filter clause...
But you should already get the 10 best eventId in the "aggregations" object of your response.

sylvainb · June 13, 2018, 10:51am

FYI, an event document looks like this:

{
          "created_at": "2017-01-04T17:06:06.480+00:00",
          "updated_at": "2017-01-04T17:06:06.480+00:00",
          "type": "booking",
          "subType": "Atelier",
          "date": "2017-01-03",
          "stat": 1,
          "userId": 2396,
          "gender": "female",
          "age": 49,
          "group": "standard",
          "reservationId": 4560,
          "ca": 27,
          "name": "OPENLAB",
          "eventId": 246,
          "eventDate": "2017-01-14",
          "ageRange": "",
          "eventTheme": ""
        }

The best criteria is determined by the sum of the ca fields in a group. For exemple, I need the 10 groups by age where the sum of ca is the maximum of all the groups by age.

I hope to be clear enough...

klof · June 13, 2018, 11:37am

Yes perfectly clear!
You're right! So I tried with the top_hits aggregation, and it should fulfill the need.
Could you try this one, and check the result?
You should get the list of 10 eventId, sorted by "sum_CA" which is an aggs of the events CA.
The "top_events" aggs lists only one event (sort by the highest CA), but you could also increase the size to get more events (reservations) for this eventId.

POST /stats/event/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "term": {
            "type": "booking"
          }
        }
      ]
    }
  },
  "aggs": {
    "top_events_id": {
      "terms": {
        "field": "eventId",
        "size": 10,
        "order": {
          "sum_CA": "desc"
        }
      },
      "aggs": {
        "top_events": {
          "top_hits": {
            "size": 1,
            "sort": [
              {
                "ca": "desc"
              }
            ]
          }
        },
        "sum_CA": {
          "sum": {
            "field": "ca"
          }
        }
      }
    }
  }
}

sylvainb · June 13, 2018, 1:34pm

Wow! That seems to work perfectly for every cases ...
I don't know how to thank you, I would have never think about this by myself.

Maybe I can ask you some more info how does the top_hits aggregation works in this case? I can't figure out how is this query working...

klof · June 14, 2018, 6:39am

From what i understood, the first aggregation will be your top-level grouping, as you want to group by eventId, then the top_hits aggs will serve on how you want to aggregate the sub-data, in your exemple you could also set "top_hits": {}, if the sub-documents don't matter.
The interesting part, is the nested sum aggregation, of the CA grouped by eventId, that you can refer in the top-level aggs to sort the eventId.
It is just my understanding of this aggregation, I don't know if it's clear, but you should play with it and try the different options available.
Thank you for your use case, it's very interesting!

system · July 12, 2018, 6:39am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Deprecated size:0 on terms aggregation Elasticsearch	2	712	August 31, 2017
How can I get top five results in date_histogram aggregations? Elasticsearch	11	5571	August 1, 2017
Displaying top 1000 results only in elastic search Elasticsearch	6	7500	July 5, 2017
How to return bucket with 0 count in terms aggregation Elasticsearch	0	82	April 11, 2024
All results in aggregate ES5.x Elasticsearch	1	399	January 12, 2017

Aggregation size 0 for top results

Related topics