I have "city", "state" and "country" fields in the index and need to calculate the document count by each city. The index contains more than 2 million documents.
This was my first stab at it:
get students/_search
{
"size": 0,
"aggs": {
"country": {
"terms": {
"size": 1000,
"field": "country.keyword"
},
"aggs": {
"state": {
"terms": {
"size": 50,
"field": "state.keyword"
},
"aggs": {
"city": {
"terms": {
"size": 3,
"field": "city.keyword"
}
}
}
}
}
}
}
}
This works. But there was a Kibana warning:
#! Deprecation: This aggregation creates too many buckets (10001) and will throw an error in future versions. You should update the [search.max_buckets] cluster setting or use the [composite] aggregation to paginate all buckets in multiple requests.
As a side note, I initially ran the above query with a "size" of 1000 for both "state" and "city" aggregations. The development server ran out of heap space. So was forced to reduce the numbers seen in the query.
Going back to the warning, it sounds like I would switch to a composite query for paginating through the buckets. So I attempted to switch the query to a composite aggregation. But have had limited success.
Attempt 1
GET students/_search
{
"size": 0,
"aggs": {
"country_state_city": {
"composite": {
"sources": [
{
"product": {
"terms": {
"field": "state.keyword"
}
}
},
{
"product": {
"terms": {
"field": "city.keyword"
}
}
},
{
"product": {
"terms": {
"field": "country.keyword"
}
}
}
]
}
}
}
}
This gave the following result:
"aggregations": {
"country_state_city": {
"buckets": [
{
"key": {
"product": "Greece"
},
"doc_count": 1
},
{
"key": {
"product": "Italy"
},
"doc_count": 1
},
{
"key": {
"product": "France"
},
"doc_count": 1
},
{
"key": {
"product": "United States"
},
"doc_count": 1
},
I know for a fact that "United States" has more than one document. After reading the documentation, I am not entirely sure if the query was coined correctly.
Attempt 2
GET students/_search
{
"size": 0,
"track_total_hits" : false,
"aggs": {
"country_state_city": {
"composite": {
"size": 100,
"sources": [
{
"country_state_city": {
"terms": {
"script": {
"source": "def country= doc['country.keyword'].value; country = (null == country ? new String() : country); def state = doc['state.keyword'].value; state = (null == state ? new String() : ', ' + state); def city = doc['city.keyword'].value; city = (null == city ? new String() : ', ' + city); country + state + city; ",
"lang": "painless"
},
"order": "desc"
}
}
}
]
}
}
}
}
The latest ES version supports runtime fields. But 6.2 (that I am stuck with) does not appear to support this (at least from what I can tell).
Some documents have null values for country, state and city. Hence the null check.
The above query gave the following result:
"aggregations": {
"country_state_city": {
"buckets": [
{
"key": {
"country_state_city": "Åland Islands, Mariehamn"
},
"doc_count": 6
},
{
"key": {
"country_state_city": "Åland Islands, Geta"
},
"doc_count": 1
},
{
"key": {
"country_state_city": "Zimbabwe, Victoria Falls"
},
"doc_count": 31
},
This again is wrong (the actual numbers are higher).
I am wondering what I could be doing wrong?
My Elasticsearch version is 6.2.4. Upgrading the cluster is not an option (operations team is too busy).