How to avoid or ease terms aggregation for fields with high cardinality?

Hi community!

My Elasticsearch version is 7.7.0. I'm facing a too many buckets issue for my aggregation query.
Task description:
Let's say we have a car index, car data model looks like below:

{
    "vin": "08c00679-8a58-11ec-9745-1aed518f8637",
    "make": "Lexus",
    "model": "GS",
    "country": "US",
    "state": "CA",
    "city": "San-Diego",
    "color": "black",
    // some other fields
    "dates": [
      {
        "type": "manufacture",
        "date": "2010-01-01"
      },
      {
        "type": "lastRepair",
        "date": "2022-01-02"
      }
    ]
  }

User can search and group cars data by several fields (make, country, state, etc)
As a result of search and grouping user can get some table like below:


Basically, the rule for table cell is when the values count is less than 5 - all the values are shown, otherwise Many(count) is displayed.

The issue:
For each column I have two aggregations: cardinality and terms. The rough query I use is below:

GET cars.read/_search
{
  "size": 0,
  "query": {
    ...
  },
  "aggregations": {
    "root": {
      "composite": {
        "size": 50,
        "sources": [{
          "countries": {
            "terms": {
              "field": "country"
            }
          }
        }, {
          "states": {
            "terms": {
              "field": "state"
            }
          }
        }]
      },
      "aggregations": {
        "cities": {
          "terms": {
            "field": "city",
            "size": 5
          }
        },
        "citiesCount": {
          "cardinality": {
            "field": "city"
          }
        },
        "makes": {
          "terms": {
            "field": "make",
            "size": 5
          }
        },
        "makesCount": {
          "cardinality": {
            "field": "make"
          }
        },
        "models": {
          "terms": {
            "field": "model",
            "size": 5
          }
        },
        "modelsCount": {
          "cardinality": {
            "field": "model"
          }
        },
        "colors": {
          "terms": {
            "field": "color",
            "size": 5
          }
        },
        "colorsCount": {
          "cardinality": {
            "field": "color"
          }
        },
       // other fields aggregations
        "datesNested": {
          "nested": {
            "path": "dates"
          },
          "aggregations": {
            "dates": {
              "terms": {
                "field": "dates.type",
                "size": 10
              },
              "aggregations": {
                "value": {
                  "terms": {
                    "field": "dates.date",
                    "missing": 0,
                    "size": 5
                  }
                },
                "valueCount": {
                  "cardinality": {
                    "field": "dates.date",
                    "missing": 0
                  }
                }
              }
            }
          }
        }
      }
    }
  }
}

Under some conditions I'm getting too many buckets exception. As far as I understand even though I add "size" : 5 line to term aggregations, Elastic will create buckets for all the unique values within a shard, so hitting the 10k bucket limit is no surprise.
In fact, I don't really need terms aggregation when field cardinality is greater than 5, so in most cases I'm just wasting Elasticsearch resources.
So I'm wondering:

  1. Is there a way to have some kind of conditional terms aggregation, e.g. do the terms aggregation when cardinality is low?
  2. I tried sampler aggregation, but it didn't seem to help. Sampler aggregation was added as a sub-aggregation to root, not sure if it's a good place to use sampler.
  3. Does it make sense to play around with term's shard_size field? I don't need 5 top scored values, any 5 values are fine.
  4. Will significant_terms (or any other) aggregation help me here?

Would be grateful for any reply.

Best regards,
Alex

Sounds like the scripted metric aggregation would be the way to implement this special logic

1 Like

I suppose not. I changed search.max_buckets and size for terms aggregation, terms aggregation with size the same as or less than search.max_buckets worked well.

The number of combinations would cause the extreme number of buckets. There are 50 composite buckets and for example 10*5 on datesNested aggregation. They come up to 50 * 10 * 5 = 2500 buckets. If there are some other fields, it could be above the limit.

As pagination is possible with composite aggregation, it could be a simple solution in this case.

@Mark_Harwood thanks for reply, did some tests - looks promising!

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.