How to avoid or ease terms aggregation for fields with high cardinality?

asupranovich · February 10, 2022, 2:12pm

Hi community!

My Elasticsearch version is 7.7.0. I'm facing a too many buckets issue for my aggregation query.
Task description:
Let's say we have a car index, car data model looks like below:

{
    "vin": "08c00679-8a58-11ec-9745-1aed518f8637",
    "make": "Lexus",
    "model": "GS",
    "country": "US",
    "state": "CA",
    "city": "San-Diego",
    "color": "black",
    // some other fields
    "dates": [
      {
        "type": "manufacture",
        "date": "2010-01-01"
      },
      {
        "type": "lastRepair",
        "date": "2022-01-02"
      }
    ]
  }

User can search and group cars data by several fields (make, country, state, etc)
As a result of search and grouping user can get some table like below:

Basically, the rule for table cell is when the values count is less than 5 - all the values are shown, otherwise Many(count) is displayed.

The issue:
For each column I have two aggregations: cardinality and terms. The rough query I use is below:

GET cars.read/_search
{
  "size": 0,
  "query": {
    ...
  },
  "aggregations": {
    "root": {
      "composite": {
        "size": 50,
        "sources": [{
          "countries": {
            "terms": {
              "field": "country"
            }
          }
        }, {
          "states": {
            "terms": {
              "field": "state"
            }
          }
        }]
      },
      "aggregations": {
        "cities": {
          "terms": {
            "field": "city",
            "size": 5
          }
        },
        "citiesCount": {
          "cardinality": {
            "field": "city"
          }
        },
        "makes": {
          "terms": {
            "field": "make",
            "size": 5
          }
        },
        "makesCount": {
          "cardinality": {
            "field": "make"
          }
        },
        "models": {
          "terms": {
            "field": "model",
            "size": 5
          }
        },
        "modelsCount": {
          "cardinality": {
            "field": "model"
          }
        },
        "colors": {
          "terms": {
            "field": "color",
            "size": 5
          }
        },
        "colorsCount": {
          "cardinality": {
            "field": "color"
          }
        },
       // other fields aggregations
        "datesNested": {
          "nested": {
            "path": "dates"
          },
          "aggregations": {
            "dates": {
              "terms": {
                "field": "dates.type",
                "size": 10
              },
              "aggregations": {
                "value": {
                  "terms": {
                    "field": "dates.date",
                    "missing": 0,
                    "size": 5
                  }
                },
                "valueCount": {
                  "cardinality": {
                    "field": "dates.date",
                    "missing": 0
                  }
                }
              }
            }
          }
        }
      }
    }
  }
}

Under some conditions I'm getting too many buckets exception. As far as I understand even though I add "size" : 5 line to term aggregations, Elastic will create buckets for all the unique values within a shard, so hitting the 10k bucket limit is no surprise.
In fact, I don't really need terms aggregation when field cardinality is greater than 5, so in most cases I'm just wasting Elasticsearch resources.
So I'm wondering:

Is there a way to have some kind of conditional terms aggregation, e.g. do the terms aggregation when cardinality is low?
I tried sampler aggregation, but it didn't seem to help. Sampler aggregation was added as a sub-aggregation to root, not sure if it's a good place to use sampler.
Does it make sense to play around with term's shard_size field? I don't need 5 top scored values, any 5 values are fine.
Will significant_terms (or any other) aggregation help me here?

Would be grateful for any reply.

Best regards,
Alex

Mark_Harwood · February 11, 2022, 9:14am

Sounds like the scripted metric aggregation would be the way to implement this special logic

Tomo_M · February 11, 2022, 10:59am

I suppose not. I changed search.max_buckets and size for terms aggregation, terms aggregation with size the same as or less than search.max_buckets worked well.

The number of combinations would cause the extreme number of buckets. There are 50 composite buckets and for example 10*5 on datesNested aggregation. They come up to 50 * 10 * 5 = 2500 buckets. If there are some other fields, it could be above the limit.

As pagination is possible with composite aggregation, it could be a simple solution in this case.

asupranovich · February 12, 2022, 11:34am

@Mark_Harwood thanks for reply, did some tests - looks promising!

system · March 12, 2022, 11:35am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Can i get bucket count of terms aggregation? Elasticsearch	4	531	November 18, 2019
The problem when using cardinality aggs! Elasticsearch	2	631	March 24, 2017
Terms Aggregation - Maximum acceptable cardinality Elasticsearch	2	304	January 18, 2019
Sub aggregations on aggregations with 'limited' results (e.g. terms) Elasticsearch	4	528	July 6, 2017
Filter search results by aggregation (cardinality > X) Elasticsearch	2	6447	July 18, 2017

How to avoid or ease terms aggregation for fields with high cardinality?

Related topics