**_NOTE: at this point we're focusing more on the functional design aspect rathe…r than performance. Once we get this nailed down, we'll see how far we can push and optimize.**_
### Background
The new aggregations module is due to elasticsearch 1.0 release, and aims to serve as the next generation replacement for the functionality we currently refer to as "faceting". Facets, currently provide a great way to aggregate data within a document set context. This context is defined by the executed query in combination with the different levels of filters that are defined (filtered queries, top level filters, and facet level filters). Although powerful as is, the current facets implementation was not designed from ground up to support complex aggregations and thus limited. The main problem with the current implementation stem in the fact that they are hard coded to work on one level and that the different types of facets (which account for the different types of aggregations we support) cannot be mixed and matched dynamically at query time. It is not possible to compose facets out of other facet and the user is effectively bound to the top level aggregations that we defined and nothing more than that.
The goal with the new aggregations module is to break the barriers the current facet implementation put in place. The new name ("Aggregations") also indicate the intention here - a generic yet extremely powerful framework for defining aggregations - any type of aggregation. The idea here is to have each aggregation defined as a "standalone" aggregation that can perform its task within any context (as a top level aggregation or embedded within other aggregations that can potentially narrow its computation scope). We would like to take all the knowledge and experience we've gained over the years working with facets and apply it when building the new framework.
Before we dive into the meaty part, it's important to set some key concepts and terminology first.
### Key Concepts & Terminology
- **Aggregation** - An aggregation is the result of an aggregation :). There are many types of aggregations, some look similar , others have their own unique structure (all depending on the nature of the aggregation). For example, a `terms` aggregation holds a list of objects (buckets), each holding information about a unique term. While an `avg` aggregation, just holds the avg number aggregated over all values of a specific field/s within a well defined set of documents.
- **Aggregator** - An aggregator is the computation unit in elasticsearch which generates aggregations. It is effectively responsible for aggregating the data during query phase, and at the end of this phase, create the output aggregation. Each aggregation type has a dedicated aggregator which knows how to compute and generate it.
There are two types of aggregators/aggregations:
- **Bucket** - A family of aggregators whos main responsibility is to define the current document set context and split it into buckets, where each bucket defines a well defined document set context. Typically, all aggregators of this type will also return the document count in each bucket. This aggregator is composable, meaning, one can define other aggregations under it. It will then perform these defined aggregations for each of the buckets it builds. It is therefore possible to create buckets within buckets within buckets... up to any level of hierarchy one desires. For example, one can define a filter bucket that holds all the "active" users (for example, if the documents represent website users/visitors), under which she'll define a range bucket that build 3 buckets to represent different user age groups, under each age group she'll define a terms bucket to narrow down the most common tags each age group is using on the website. As you can see, creating hierarchies of buckets can be extremely powerful can immensely help when sliding & dicing your data.
- **Calc** - A family of aggregators whos sole responsibility is to perform computation and calculate numbers. It always operates in a well defined scope of a document set. This document set scope is either the top most level one - the scope defined by the search query, or otherwise defined by a higher level bucket aggregator (as discussed above). The Calc Aggregators typically work on field values, therefore utilizing the field data from which they extract these values. But one can utilise scripts to compute custom values which will be aggregated in different ways (depending on the specific calc aggregator that is used). If combining (mixing & matching) all different types of aggregators, while bucket aggregators can be placed anywhere in the aggregation definition "tree", calc aggregators are always "leaves" on the tree as (unlike bucket aggregators) they cannot contain other aggregators.
#### Structuring Aggregations
The following snippet captures the basic structure of aggregations:
``` json
"aggregations" : {
"<aggregation_name>" : {
"<aggregation_type>" : {
<aggregation_body>
},
["aggregations" : { [<sub_aggregation>]* } ]
}
[,"<aggregation_name_2>" : { ... } ]*
}
```
The `aggregations` object (can also be `aggs` for short) in the json holds the aggregations you'd like to be computed. Each aggregation is associated with a logical name that the user defines (e.g. if the aggregation computes the average price, then it'll make sense to call it `avg_price`). These logical names, also uniquely identify the aggregations you define (you'll use the same names/keys to identify the aggregations in the response). Each aggregation has a specific type (`<aggregation_type>` in the above snippet) and is typically the first key within the named aggregation body. Each type of aggregation define its own body, depending on the nature of the aggregation (eg. the `avg` aggregation will define the field on which the avg will be calculated). At the same level of the aggregation type definition, one can optionally define a set of additional aggregations, but this only makes sense if the aggregation you defined is a bucketing aggregation. In this scenario, the aggregation you define on the bucketing aggregation level will be computed for all the buckets built by the bucketing aggregation. For example, if the you define a set of aggregations under the `range` aggregation, these aggregations will be computed for each of the range buckets that are defined.
In this manner, you can mix & match bucketing and calculating aggregations any way you'd like, create any set of complex hierarchies by embedding aggregations (of type bucket or calc) within other bucket aggregations. To better grasp how they can all work together, please refer to the examples section below.
### Calc Aggregators
In this section will provide an overview of all calc aggregations available to date.
All the calc aggregators we have today belong to the same family which we like to call `stats`. All the aggregator in this family are based on values that can either come from the field data or from a script that the user defines.
These aggregators operate on the following context: { _D_, _FV_ } where _D_ is the the set of documents from which the field values are extracted, and _FV_ is the set of values that should be aggregated. The aggregations take all those field values and calculates statistical values (some only calculate on value - they're called `single value stats aggregators`, while others generate a set of values - these are called `multi-value stats aggregators`).
Here are all currently available stats aggregators
#### Avg
Single Value Aggregator - Will return the average over all field values in the aggregation context, or what ever values the script generates
``` json
"aggs" : {
"avg_price" : { "avg" : { "field" : "price" } }
}
```
``` json
"aggs" : {
"avg_price" : { "avg" : { "script" : "doc['price']" } }
}
```
``` json
"aggs" : {
"avg_price" : { "avg" : { "field" : "price", "script" : "_value" } }
}
```
_NOTE: when `field` and `script` are both specified, the script will be called for every value of the field in the context, and within the script you can access this value using the reserved variable `_value`.
Output:
``` json
"avg_price" : {
"value" : 10
}
```
#### Min
Single Value Aggregator - Will return the minimum value among all field values in the aggregation context, or what ever values the script generates
``` json
"aggs" : {
"min_price" : { "min" : { "field" : "price" } }
}
```
``` json
"aggs" : {
"min_price" : { "min" : { "script" : "doc['price']" } }
}
```
``` json
"aggs" : {
"min_price" : { "min" : { "field" : "price", "script" : "_value" } }
}
```
Output:
``` json
"min_price" : {
"value" : 1
}
```
#### Max
Single Value Aggregator - Will return the maximum value among all field values in the aggregation context, or what ever values the script generates
``` json
"aggs" : {
"max_price" : { "max" : { "field" : "price" } }
}
```
``` json
"aggs" : {
"max_price" : { "max" : { "script" : "doc['price']" } }
}
```
``` json
"aggs" : {
"max_price" : { "max" : { "field" : "price", "script" : "_value" } }
}
```
Output:
``` json
"max_price" : {
"value" : 100
}
```
#### Sum
Single Value Aggregator - Will return the sum of all field values in the aggregation context, or what ever values the script generates
``` json
"aggs" : {
"sum_price" : { "sum" : { "field" : "price" } }
}
```
``` json
"aggs" : {
"sum_price" : { "sum" : { "script" : "doc['price']" } }
}
```
``` json
"aggs" : {
"sum_price" : { "sum" : { "field" : "price", "script" : "_value" } }
}
```
Output:
``` json
"sum_price" : {
"value" : 350
}
```
#### Count
Single Value Aggregator - Will return the number of field values in the aggregation context, or what ever values the script generates
``` json
"aggs" : {
"prices_count" : { "count" : { "field" : "price" } }
}
```
``` json
"aggs" : {
"prices_count" : { "count" : { "script" : "doc['price']" } }
}
```
``` json
"aggs" : {
"prices_count" : { "count" : { "field" : "price", "script" : "_value" } }
}
```
Output:
``` json
"prices_count" : {
"value" : 400
}
```
#### Stats
Multi Value Aggregator - Will return the following stats aggregated over the field values in the aggregation context, or what ever values the script generates:
- avg
- min
- max
- count
- sum
``` json
"aggs" : {
"price_stats" : { "stats" : { "field" : "price" } }
}
```
``` json
"aggs" : {
"prices_stats" : { "stats" : { "script" : "doc['price']" } }
}
```
``` json
"aggs" : {
"prices_stats" : { "stats" : { "field" : "price", "script" : "_value" } }
}
```
Output:
``` json
"prices_stats" : {
"min" : 1,
"max" : 10,
"avg" : 5.5,
"sum" : 55,
"count" : 10,
}
```
#### Extended Stats
Multi Value Aggregator - an extended version of the Stats aggregation above, where in addition to its aggregated statistics the following will also be aggregated:
- sum_of_squares
- variance
- std_deviation
``` json
"aggs" : {
"price_stats" : { "extended_stats" : { "field" : "price" } }
}
```
``` json
"aggs" : {
"prices_stats" : { "extended_stats" : { "script" : "doc['price']" } }
}
```
``` json
"aggs" : {
"prices_stats" : { "extended_stats" : { "field" : "price", "script" : "_value" } }
}
```
Output:
``` json
"value_stats": {
"count": 10,
"min": 1.0,
"max": 10.0,
"avg": 5.5,
"sum": 55.0,
"sum_of_squares": 385.0,
"variance": 8.25,
"std_deviation": 2.8722813232690143
}
```
### Bucket Aggregators
Bucket aggregators don't calculate values over fields like the `calc` aggregators do, but instead, they create buckets of documents. Each bucket defines a criteria (depends on the aggregation type) that determines whether or not a document in the current context "falls" in it. In other words, the buckets effectively define document sets (a.k.a docsets) on which the sub-aggregations are running on.
There a different bucket aggregators, each with a different "bucketing" strategy. Some define a single bucket, some define fixed number of multiple bucket, and others dynamically create the buckets while evaluating the docs.
The following describe the currently supported bucket aggregators.
#### Global
Defines a single bucket of all the documents within the search execution context. This context is defined by the indices and the document types you're searching on, but is **not** influenced by the search query itself.
_Note, global aggregators can only be placed as top level aggregators (it makes no sense to embed a global aggregator within another bucket aggregator)_
``` json
"aggs" : {
"global_stats" : {
"global" : {}, // global has an empty body
"aggs" : {
"avg_price" : { "avg" : { "field" : "price" } }
}
}
}
```
Output
``` json
"aggs" : {
"global_stats" : {
"doc_count" : 100,
"avg_price" : { "value" : 56.3 }
}
}
```
#### Filter
Defines a single bucket of all the documents in the current docset context which match a specified filter. Often this will be used to narrow down the current aggregation context to a specific set of documents.
``` json
"aggs" : {
"active_items" : {
"filter" : { "term" : { "active" : true } },
"aggs" : {
"avg_price" : { "avg" : { "field" : "price" } }
}
}
}
```
Output
``` json
"aggs" : {
"active_items" : {
"doc_count" : 100,
"avg_price" : { "value" : 56.3 }
}
}
```
#### Missing
A field data based single bucket aggregator, that creates a bucket of all documents in the current docset context that are missing a field value. This aggregator will often be used in conjunction with other field data bucket aggregators (such as ranges) to return information for all the documents that could not be placed in any of the other buckets due to missing field data values. (The examples bellow show how well the range and the missing aggregators play together).
``` json
"aggs" : {
"missing_price" : {
"missing" : { "field" : "price" }
}
}
```
Output
``` json
"aggs" : {
"missing_price" : {
"doc_count" : 10
}
}
```
#### Terms
A field data based multi-bucket aggregator where buckets are dynamically built - one per unique value (term) of a specific field. For each such bucket the document count will be aggregated (accounting for all the documents in the current docset context that have that term for the specified field). This aggregator is very similar to how the terms facet works except that it is an aggregator just like any other aggregator, meaning it can be embedded in other bucket aggregators and it can also hold any types of sub-aggregators itself.
``` json
"aggs" : {
"genders" : {
"terms" : { "field" : "gender" },
"aggs" : {
"avg_height" : { "avg" : { "field" : "height" } }
}
}
}
```
Output
``` json
"aggs" : {
"genders" : {
"terms" : [
{
"term" : "male",
"doc_count" : 10,
"avg_height" : 178.5
},
{
"term" : "female",
"doc_count" : 10,
"avg_height" : 165
},
]
}
}
```
**TODO: do we want to get rid of the "terms" level in the response and directly put the terms array under the aggregation name? (we do that in range aggregation)**
##### Options
| Name | Default | Required | Description |
| :-- | :-: | :-: | :-- |
| field | - | yes/no | the name of the field from which the terms will be taken. It is required if there is no other field data based aggregator in the current aggregation context and the **script** option is also not set |
| size | 10 | no | Only the top _n_ terms will be returned, the size determines what this _n_ is |
| order | count desc | no | the order in which the term bucket will be sorted, see bellow for possible values |
| script | - | no | one can choose to let a script generate the terms instead of extracting them verbatim from the field data. If the script is define along with the field, then this script will be executed for every term/value of the field data with a special variable **_value** which will provide access to that value from within the script (this is as opposed to specifying only the script, without the field, in which case the script will execute once per document in the aggregation context) |
##### About order
One can define the order in which the term buckets will be sorted and therefore return in the response. There are 4 fixed/pre-defined order types and one more dynamic:
Order by term (alphabetically) ascending/descending:
``` json
"aggs" : {
"genders" : {
"terms" : { "field" : "gender", "order": { "_term" : "desc" } }
}
}
```
Order by count (alphabetically) ascending/descending:
``` json
"aggs" : {
"genders" : {
"terms" : { "field" : "gender", "order": { "_count" : "asc" } }
}
}
```
Order by direct embedded calc aggregation, ascending/descending. For single value calc aggregation:
``` json
"aggs" : {
"genders" : {
"terms" : { "field" : "gender", "order": { "avg_price" : "asc" } },
"aggs" : {
"avg_price" : { "avg" : { "field" : "price" } }
}
}
}
```
Or, for multi-value calc aggregation:
``` json
"aggs" : {
"genders" : {
"terms" : { "field" : "gender", "order": { "price_stats.avg" : "desc" } },
"aggs" : {
"price_stats" : { "stats" : { "field" : "price" } }
}
}
}
```
#### Range
A field data bucket aggregation that enables the user to define a field on which the bucketing will work and a set of ranges. The aggregator will check each field data value in the current docset context against each bucket range and "bucket" the relevant document & values if they match. Note, that here, not only we're bucketing by document, we're also bucketing by value. For example, let's say we're bucketing on multi-value field, and document _D_ has values [1, 2, 3, 4, 5] for the field. In addition, there is a range bucket [ x < 4 ]. When evaluating document _D_, it seems to fall right in this range bucket, but it does so due to field values [1, 2, 3], not because values [4, 5]. Now… if this bucket will also have a sub-aggregators associated with it (say, sum aggregator), the system will make sure to only aggregate values [1, 2, 3] excluding [4, 5](as 4 and 5 as values, don't really belong to this bucket). This is quite different than the other bucket aggregators we've seen until now which mainly focused on whether the document falls in the bucket or not. Here we also keep track of the values belonging to each bucket.
``` json
"aggs" : {
"age_groups" : {
"range" : {
"field" : "age",
"ranges" : [
{ "to" : 5 },
{ "from" : 5, "to" : 10 },
{ "from" : 10, "to" : 15 },
{ "from" : 15}
]
},
"aggs" : {
"avg_height" : { "avg" : { "field" : "height" } }
}
}
}
```
Output
``` json
"aggregations" : {
"age_groups" : [
{
"to" : 5.0,
"doc_count" : 10,
"avg_height" : 95
},
{
"from" : 5.0,
"to" : 10.0,
"doc_count" : 5,
"avg_height" : 130
},
{
"from" : 10.0
"to" : 15.0,
"doc_count" : 4,
"avg_height" : 160
},
{
"from" : 15.0,
"doc_count" : 10,
"avg_height" : 175.5
}
]
}
```
Of course, you normally don't want to store the **age** as a field, but store the birthdate instead. We can use scripts to generate the age:
``` json
"aggs" : {
"age_groups" : {
"range" : {
"script" : "DateTime.now().year - doc['birthdate'].date.year",
"ranges" : [
{ "to" : 5 },
{ "from" : 5, "to" : 10 },
{ "from" : 10, "to" : 15 },
{ "from" : 15}
]
},
"aggs" : {
"avg_height" : { "avg" : { "field" : "height" } }
}
}
}
```
As with all other aggregations, leaving out the **field** from calc aggregator, will fall back on the field by which the range bucketing is done.
``` json
"aggs" : {
"age_groups" : {
"range" : {
"field" : "age",
"ranges" : [
{ "to" : 5 },
{ "from" : 5, "to" : 10 },
{ "from" : 10, "to" : 15 },
{ "from" : 15}
]
},
"aggs" : {
"min" : { "min" : { } },
"max" : { "max" : { } }
}
}
}
```
Output
``` json
"aggregations" : {
"age_groups" : [
{
"to" : 5.0,
"doc_count" : 10,
"min" : 4.0,
"max" : 5.0
},
{
"from" : 5.0,
"to" : 10.0,
"doc_count" : 5,
"min" : 5.0,
"max" : 8.0
},
{
"from" : 10.0
"to" : 15.0,
"doc_count" : 4,
"min" : 11.0,
"max" : 13.0
},
{
"from" : 15.0,
"doc_count" : 10,
"min" : 15.0,
"max" : 22.0
}
]
}
```
Furthermore, you can also define a value script which will serve as a transformation to the field data value:
``` json
"aggs" : {
"age_groups" : {
"range" : {
"field" : "count",
"script" : "_value - 3"
"ranges" : [
{ "to" : 6 },
{ "from" : 6 }
]
},
"aggs" : {
"min" : { "min" : {} },
"min_count" : { "min" : { "field" : "count" } }
}
}
}
```
Output
``` json
"aggregations": {
"count_ranges": [
{
"to": 6.0,
"doc_count": 8,
"min": {
"value": -2.0
},
"min_count": {
"value": 1.0
}
},
{
"from": 6.0,
"doc_count": 2,
"min": {
"value": 6.0
},
"min_count": {
"value": 9.0
}
}
]
}
```
Notice, the **min** aggregation above acts on the actual values that were used for the bucketing (after the transformation by the script), while the **min_count** aggregation act on the values of the count field that fall within their bucket.
#### Date Range
A range aggregation that is dedicated for date values. The main difference between this date range agg. to the normal range agg. is that the `from` and `to` values can be expressed in _Date Math_ expressions, and it is also possible to specify a date format by which the `from` and `to` json fields will be returned in the response:
``` json
"aggs": {
"range": {
"date_range": {
"field": "date",
"format": "MM-yyy",
"ranges": [
{
"to": "now-10M/M"
},
{
"from": "now-10M/M"
}
]
}
}
}
```
In the example above, we created two range buckets:
- the first will bucket all documents dated prior to 10 months ago
- the second will bucket all document dated since 10 months ago
``` json
"aggregations": {
"range": [
{
"to": 1.3437792E+12,
"to_as_string": "08-2012",
"doc_count": 7
},
{
"from": 1.3437792E+12,
"from_as_string": "08-2012",
"doc_count": 2
}
]
}
```
#### IP Range
Just like the dedicated date range aggregation, there is also a dedicated range aggregation for IPv4 typed fields:
``` json
"aggs" : {
"ip_ranges" : {
"ip_range" : {
"field" : "ip",
"ranges" : [
{ "to" : "10.0.0.5" },
{ "from" : "10.0.0.5" }
]
}
}
}
```
Output:
``` json
"aggregations": {
"ip_ranges": [
{
"to": 167772165,
"to_as_string": "10.0.0.5",
"doc_count": 4
},
{
"from": 167772165,
"from_as_string": "10.0.0.5",
"doc_count": 6
}
]
}
```
IP ranges can also be defined as CIDR masks:
``` json
"aggs" : {
"ip_ranges" : {
"ip_range" : {
"field" : "ip",
"ranges" : [
{ "mask" : "10.0.0.0/25" },
{ "mask" : "10.0.0.127/25" }
]
}
}
}
```
Output:
``` json
"aggregations": {
"ip_ranges": [
{
"key": "10.0.0.0/25",
"from": 1.6777216E+8,
"from_as_string": "10.0.0.0",
"to": 167772287,
"to_as_string": "10.0.0.127",
"doc_count": 127
},
{
"key": "10.0.0.127/25",
"from": 1.6777216E+8,
"from_as_string": "10.0.0.0",
"to": 167772287,
"to_as_string": "10.0.0.127",
"doc_count": 127
}
]
}
```
#### Histogram
An aggregation that can be applied to numeric fields, and dynamically builds fixed size (a.k.a. interval) buckets over all the values of the document fields in the docset context. For example, if the documents have a field that holds a price (numeric), we can ask this aggregator to dynamically build buckets with interval 5 (in case of `price` it may represent $5). When the aggregation executes, the price field of every document within the aggregation context will be evaluated and will be **rounded** down to its closes bucket - for example, if the price is `32` and the bucket size is `5` then the rounding will yield `30` and thus the document will "fall" into the bucket the bucket that is associated withe the key `30`. To make this more formal, here is the rounding function that is used:
`bucket_key = value - value % interval`
A basic histogram aggergation on a single numeric field `value` (maybe be single or multi valued field)
``` json
"aggs" : {
"value_histo" : {
"histogram" : {
"field" : "value",
"interval" : 3
}
}
}
```
An histogram aggregation on multiple fields
``` json
"aggs" : {
"value_histo" : {
"histogram" : {
"field" : [ "value", "values" ],
"interval" : 3
}
}
}
```
The output of the histogram is an array of the buckets, where each bucket holds its key and the number of documents that fall in it. This array can be sorted based on different attributes in an ascending or descending order:
- `_key` - The buckets will be sorted by their key
- `_count` - The buckets will be sorted by the number of documents that fall in them
- `aggName` - Bucket may hold other aggegations that will be applied to those documents that fall in them. It is possible to sort the buckets based on direct single-valued **calc** aggregations that they hold
- `aggName` & `valueName` - It is also possible to sort buckets based on direct multi-valued **calc** aggregations that they hold
Sorting by bucket `key` descending
``` json
"aggs" : {
"histo" : {
"histogram" : {
"field" : "value",
"interval" : 3,
"order" : { "_key" : "desc" }
}
}
}
```
Sorting by document count ascending
``` json
"aggs" : {
"histo" : {
"histogram" : {
"field" : "value",
"interval" : 3,
"order" : { "_count" : "asc" }
}
}
}
```
Adding a sum aggregation (which is a single valued calc aggregation) to the buckets and sorting by it
``` json
"aggs" : {
"histo" : {
"histogram" : {
"field" : "value",
"interval" : 3,
"order" : { "value_sum" : "asc" }
},
"aggs" : {
"value_sum" : { "sum" : {} }
}
}
}
```
Adding a stats aggregation (which is a multi-valued calc aggregation) to the buckets and sorting by the avg
``` json
"aggs" : {
"histo" : {
"histogram" : {
"field" : "value",
"interval" : 3,
"order" : { "value_stats.avg" : "desc" }
},
"aggs" : {
"value_stats" : { "stats" : {} }
}
}
}
```
Using value scripts to "preprocess" the values before the bucketing
``` json
"aggs" : {
"histo" : {
"histogram" : {
"field" : "value",
"script" : "_value * 4",
"interval" : 3,
"order" : { "sum" : "desc"}
},
"aggregations" : {
"sum" : { "sum" : {} }
}
}
}
```
It's also possible to use document level scripts to compute the value by which the documents will be "bucketted"
``` json
"aggs" : {
"histo" : {
"histogram" : {
"script" : "doc['value'].value + doc['value2'].value",
"interval" : 3,
"order" : { "stats.sum" : "desc" }
},
"aggregations" : {
"stats" : { "stats" : {} }
}
}
}
```
Output:
``` json
"aggregations": {
"histo": [
{
"key": 21,
"doc_count": 2,
"stats": {
"count": 2,
"min": 8.0,
"max": 9.0,
"avg": 8.5,
"sum": 17.0
}
},
{
"key": 15,
"doc_count": 2,
"stats": {
"count": 2,
"min": 5.0,
"max": 6.0,
"avg": 5.5,
"sum": 11.0
}
},
{
"key": 24,
"doc_count": 1,
"stats": {
"count": 1,
"min": 10.0,
"max": 10.0,
"avg": 10.0,
"sum": 10.0
}
},
{
"key": 18,
"doc_count": 1,
"stats": {
"count": 1,
"min": 7.0,
"max": 7.0,
"avg": 7.0,
"sum": 7.0
}
},
{
"key": 9,
"doc_count": 2,
"stats": {
"count": 2,
"min": 2.0,
"max": 3.0,
"avg": 2.5,
"sum": 5.0
}
},
{
"key": 12,
"doc_count": 1,
"stats": {
"count": 1,
"min": 4.0,
"max": 4.0,
"avg": 4.0,
"sum": 4.0
}
},
{
"key": 6,
"doc_count": 1,
"stats": {
"count": 1,
"min": 1.0,
"max": 1.0,
"avg": 1.0,
"sum": 1.0
}
}
]
}
```
#### Date Histogram
Date histogram is a similar aggregation to the normal histogram (as described above) except that it can only work on date fields. Since dates are indexed internally as long values, it's possible to use the normal histogram on dates as well, but problem though stems in the fact that time based intervals are not fixed (think of leap years and on the number of days in a month). For this reason, we need a spcial support for time based data. From functionality perspective, this historam supports the same features as the normal histogram. The main difference though is that the interval can be specified by time expressions.
Building a month length bucket intervals
``` json
"aggs" : {
"histo" : {
"date_histogram" : {
"field" : "date",
"interval" : "month"
}
}
}
```
or based on 1.5 months
``` json
"aggs" : {
"histo" : {
"date_histogram" : {
"field" : "date",
"interval" : "1.5M"
}
}
}
```
Other available expressions for interval: `year`, `quarter`, `week`, `day`, `hour`, `minute`, `second`
Since internally, dates are represented as 64bit numbers, these numbers are returned as the bucket keys (each key representing a date). For this reason, it is also possible to define a date format, which will result in returning the dates as formatted strings next to the numeric key values:
``` json
"aggs" : {
"histo" : {
"date_histogram" : {
"field" : "date",
"interval" : "1M",
"format" : "yyyy-MM-dd"
}
}
}
```
Output:
``` json
"aggregations": {
"histo": [
{
"key_as_string": "2012-02-02",
"key": 1328140800000,
"doc_count": 1
},
{
"key_as_string": "2012-03-02",
"key": 1330646400000,
"doc_count": 2
},
...
]
}
```
Timezones are also supported, enabling the user to define by which timezone they'd like to bucket the documents (this support is very similar to the TZ support in the DateHistogram facet).
Similar to the current date histogram facet, pref_offset & post_offset will are also supported, for offsets applying pre rounding and post rounding. The values are time values with a possible `-` sign. For example, to offset a week rounding to start on Sunday instead of Monday, one can pass pre_offset of -1d to decrease a day before doing the week (monday based) rounding, and then have post_offset set to -1d to actually set the return value to be Sunday, and not Monday.
Like with the normal histogram, both document level scripts and value scripts are supported. It is possilbe to control the order of the buckets that are returned. And of course, nest other aggregations within the buckets.
Both the normal `histogram` and the `date_histogram` now support computing/returning empty buckets. This can be controlled by setting the `compute_empty_buckets` parameter to `true` (defaults to `false`).
#### Geo Distance
An aggregation that works on `geo_point` fields. Conceptually, it works very similar to range aggregation. The user can define a point of `origin` and a set of distance range buckets. The aggregation evaluate the distance of each document from the `origin` point and determine the bucket it belongs to based on the ranges (a document belongs to a bucket if the distance between the document and the `origin` falls within the distance range of the bucket).
``` json
"aggs" : {
"rings" : {
"geo_distance" : {
"field" : "location",
"origin" : "52.3760, 4.894",
"ranges" : [
{ "to" : 100 },
{ "from" : 100, "to" : 300 },
{ "from" : 300 }
]
}
}
}
```
Output
``` json
"aggregations": {
"rings": [
{
"unit": "km",
"to": 100.0,
"doc_count": 3
},
{
"unit": "km",
"from": 100.0,
"to": 300.0,
"doc_count": 1
},
{
"unit": "km",
"from": 300.0,
"doc_count": 7
}
]
}
```
The specified `field` must be of type `geo_point` (which can only be set explicitly in the mappings). And it can also hold an array of `geo_point` fields, in which case all will be taken into account during aggregation. The `origin` point can accept all format `geo_point` supports:
- Object format: `{ "lat" : 52.3760, "lon" : 4.894 }` - this is the safest format as it's the most explicit about the `lat` & `lon` values
- String format: `"52.3760, 4.894"` - where the first number is the `lat` and the second is the `lon`
- Array format: `[4.894, 52.3760]` - which is based on the GeoJson standard and where the first number is the `lon` and the second one is the `lat`
By default, the distance unit is `km` but it can also accept: `mi` (miles), `in` (inch), `yd` (yards), `m` (meters), `cm` (centimeters), `mm` (millimeters).
``` json
"aggs" : {
"rings" : {
"geo_distance" : {
"field" : "location",
"origin" : "52.3760, 4.894",
"unit" : "mi",
"ranges" : [
{ "to" : 100 },
{ "from" : 100, "to" : 300 },
{ "from" : 300 }
]
}
}
}
```
There are two distance calculation modes: `arc` (the default) and `plane`. The `arc` calculation is the most accurate one but also the more expensive one in terms of performance. The `plane` is faster but less accurate. Consider using `plane` when your search context is narrow smaller areas (like cities or even countries). `plane` may return higher error mergins for searches across very large areans (e.g. cross atlantic search).
``` json
"aggs" : {
"rings" : {
"geo_distance" : {
"field" : "location",
"origin" : "52.3760, 4.894",
"distance_type" : "plane",
"ranges" : [
{ "to" : 100 },
{ "from" : 100, "to" : 300 },
{ "from" : 300 }
]
}
}
}
```
#### Nested
A special single bucket aggregation which enables aggregating nested documents:
assuming the following mapping:
``` json
"type" : {
"properties" : {
"nested" : { "type" : "nested" }
}
}
}
```
Here's how a nested aggregation can be defined:
``` json
"aggs" : {
"nested_value_stats" : {
"nested" : {
"path" : "nested"
},
"aggs" : {
"stats" : {
"stats" : { "field" : "nested.value" }
}
}
}
}
```
As you can see above, the nested aggregation requires the path of the nested documents within the top level documents. Then one can define any type of aggregation over these nested documents.
Output:
``` json
"aggregations": {
"employees_salaries": {
"doc_count": 25,
"stats": {
"count": 25,
"min": 1.0,
"max": 9.0,
"avg": 5.0,
"sum": 125.0
}
}
}
```
### Examples
#### Filter + Range + Missing + Stats
Analyse the online product catalog web access logs. The following aggregation will only aggregate those logs from yesterday (the **filter** aggregation), providing information for different price ranges (the **range** aggregation), where per price range we'll return the price stats on that range and the total page views for those documents in the each range. We're also interested in finding all the bloopers - all those products that for some reason don't have prices associated with them and still they are exposed to the user and being accessed and viewed.
``` json
"aggs" : {
"yesterday" : {
"filter" : { "range" : { "date" { "gt" : "now-1d/d", "lt" : "now/d" } } },
"aggs" : {
"missing_price" : {
"missing" : { "field" : "price" },
"aggs" : {
"total_page_views" : { "sum" : { "field" : "page_views" } }
}
},
"prices" : {
"range" : {
"field" : "price",
"ranges" : [
{ "to" : 100 },
{ "from" : 100, "to" : 200 },
{ "from" : 200, "to" 300 },
{ "from" : 300 }
]
},
"aggs" : {
"price_stats" : { "stats" : {} },
"total_page_views" : { "sum" : { "field" : "page_views" } }
}
}
}
}
}
```
#### Aggregating Hierarchical Data
Quite often you'd like to get aggregations on location in an hierarchical manner. For example, show all countries and how many documents fall within each country, and for each country show a breakdown by city. Here's a simple way to do it using hierarchical terms aggregations:
``` json
"aggs" : {
"country" : {
"terms" : { "field" : "country" },
"aggs" : {
"city" : {
"terms" : { "field" : "city" }
}
}
}
}
```