Terms aggregation on high cardinality field

Hello!

I am sharing the issue that I am having here hoping that someone can help me.

We are executing terms on a field which has super high cardinality. I am talking about 40 million unique entries that represents product unique identifier (ASIN).

For each of those product identifier, we have weekly data, that we need to aggregate based on different criteria, such are brand, category, time range and so on. Basically we are having 40 million of documents per week.

Example of the query is following:

{
  "size": 0,
  "query": {
    "bool": {
      "filter": [
        {
          "range": {
            "dateFrom": {
              "from": "2024-08-25",
              "to": "2024-09-21",
              "include_lower": true,
              "include_upper": false,
              "boost": 1
            }
          }
        },
        {
          "terms": {
            "brand.keyword": [
              "Nike"
            ],
            "boost": 1
          }
        }
      ],
      "adjust_pure_negative": true,
      "boost": 1
    }
  },
  "_source": false,
  "track_total_hits": -1,
  "aggregations": {
    "asins": {
      "terms": {
        "field": "asin",
        "size": 1000,
        "min_doc_count": 1,
        "shard_min_doc_count": 0,
        "show_term_doc_count_error": false,
        "order": [
          {
            "totalSales": "desc"
          },
          {
            "_key": "asc"
          }
        ]
      },
      "aggregations": {
        "totalSales": {
          "sum": {
            "field": "salesEstimate"
          }
        }
      }
    }
  }

As you may imaging we are facing performance issues, and response time is very slow, higher than 30 seconds in some cases when brand contains a lot of products.

Mapping for ASIN field:

"asin": {
          "type": "keyword",
          "eager_global_ordinals": true
        }

Each weekly index has around 10GB in size and has 1 primary shard and 1 replica shard.

Cluster has 3 master nodes, 4 data nodes, with 4 CPUs and 32 GB of RAM each.

Is there something we can do here in terms of changing cluster configuration or the way that query is built or is it expected that we have poor performances in this case, considering the number of documents and unique ASINs that we store?

Thank you in advance.

From Elastic Search to Elasticsearch

Added aggregations

Hi Ivan,
I have the exact same issue.

The field type for our field is keyword so doc_values is enabled and eager_global_ordinals is set to true.

For query returning more than 1 or 2 millions hits, the aggregation is very slow to compute. Profiling reports that GlobalOrdinalsStringTermsAggregator spends more than 99% of time in the collect phase.

Our Elasticsearch cluster is installed in a kubernetes environment with 8 data nodes each with 8 vCPU and 32 Gb RAM (8Gb heap). We don't see disk I/O on elasticsearch nodes as out of heap memory is high and so index caching is efficient.

A possible solution is to surround this aggregation in a sampler aggregation, but it is not really acceptable for us.

Did you find a solution to this issue ?

Regards

Dominique

note: 1 shard means only one core proceed the request. So, generally slice more shards, check that all cores are utilized during request.
Then, use date to slice your indices (irc there some built in facilities to do so) ie date filter may be implemented (partially) via picking the right index /sales-2024-08/_search
Cluster is a little bit modest for nowadays. If you are managed to fully utilize it, scale.
Also, this is a data cube problem Spark or Pinot may suits better for it.

Hi,

More precision bout the index. The index contains 13 millions of documents. The index is sharded with 32 shards (so 4 shards per elasticsearch node). Each shard contains 400.000 docs and has a size of 12Gb.
Dominique

My use case it that I have index with 45 million documents and size of 17Gb and with only 1 shard.
There is one of those indices for every week, for last 2 years, and we are requesting aggregations for period between 4 and 52 weeks.

No, not yet, we are still facing poor performances for time ranges longer than 4 weeks

If it is acceptable, may be you can use sampler aggregations for high cardinality fields.