Hi community!
My Elasticsearch version is 7.7.0. I'm facing a too many buckets
issue for my aggregation query.
Task description:
Let's say we have a car index, car data model looks like below:
{
"vin": "08c00679-8a58-11ec-9745-1aed518f8637",
"make": "Lexus",
"model": "GS",
"country": "US",
"state": "CA",
"city": "San-Diego",
"color": "black",
// some other fields
"dates": [
{
"type": "manufacture",
"date": "2010-01-01"
},
{
"type": "lastRepair",
"date": "2022-01-02"
}
]
}
User can search and group cars data by several fields (make, country, state, etc)
As a result of search and grouping user can get some table like below:
Basically, the rule for table cell is when the values count is less than 5 - all the values are shown, otherwise Many(count) is displayed.
The issue:
For each column I have two aggregations: cardinality and terms. The rough query I use is below:
GET cars.read/_search
{
"size": 0,
"query": {
...
},
"aggregations": {
"root": {
"composite": {
"size": 50,
"sources": [{
"countries": {
"terms": {
"field": "country"
}
}
}, {
"states": {
"terms": {
"field": "state"
}
}
}]
},
"aggregations": {
"cities": {
"terms": {
"field": "city",
"size": 5
}
},
"citiesCount": {
"cardinality": {
"field": "city"
}
},
"makes": {
"terms": {
"field": "make",
"size": 5
}
},
"makesCount": {
"cardinality": {
"field": "make"
}
},
"models": {
"terms": {
"field": "model",
"size": 5
}
},
"modelsCount": {
"cardinality": {
"field": "model"
}
},
"colors": {
"terms": {
"field": "color",
"size": 5
}
},
"colorsCount": {
"cardinality": {
"field": "color"
}
},
// other fields aggregations
"datesNested": {
"nested": {
"path": "dates"
},
"aggregations": {
"dates": {
"terms": {
"field": "dates.type",
"size": 10
},
"aggregations": {
"value": {
"terms": {
"field": "dates.date",
"missing": 0,
"size": 5
}
},
"valueCount": {
"cardinality": {
"field": "dates.date",
"missing": 0
}
}
}
}
}
}
}
}
}
}
Under some conditions I'm getting too many buckets
exception. As far as I understand even though I add "size" : 5
line to term aggregations, Elastic will create buckets for all the unique values within a shard, so hitting the 10k bucket limit is no surprise.
In fact, I don't really need terms aggregation when field cardinality is greater than 5, so in most cases I'm just wasting Elasticsearch resources.
So I'm wondering:
- Is there a way to have some kind of conditional terms aggregation, e.g. do the terms aggregation when cardinality is low?
- I tried sampler aggregation, but it didn't seem to help. Sampler aggregation was added as a sub-aggregation to root, not sure if it's a good place to use sampler.
- Does it make sense to play around with term's
shard_size
field? I don't need 5 top scored values, any 5 values are fine. - Will
significant_terms
(or any other) aggregation help me here?
Would be grateful for any reply.
Best regards,
Alex