Terms aggregation on high cardinality field

ivan83 · September 26, 2024, 3:33pm

Hello!

I am sharing the issue that I am having here hoping that someone can help me.

We are executing terms on a field which has super high cardinality. I am talking about 40 million unique entries that represents product unique identifier (ASIN).

For each of those product identifier, we have weekly data, that we need to aggregate based on different criteria, such are brand, category, time range and so on. Basically we are having 40 million of documents per week.

Example of the query is following:

{
  "size": 0,
  "query": {
    "bool": {
      "filter": [
        {
          "range": {
            "dateFrom": {
              "from": "2024-08-25",
              "to": "2024-09-21",
              "include_lower": true,
              "include_upper": false,
              "boost": 1
            }
          }
        },
        {
          "terms": {
            "brand.keyword": [
              "Nike"
            ],
            "boost": 1
          }
        }
      ],
      "adjust_pure_negative": true,
      "boost": 1
    }
  },
  "_source": false,
  "track_total_hits": -1,
  "aggregations": {
    "asins": {
      "terms": {
        "field": "asin",
        "size": 1000,
        "min_doc_count": 1,
        "shard_min_doc_count": 0,
        "show_term_doc_count_error": false,
        "order": [
          {
            "totalSales": "desc"
          },
          {
            "_key": "asc"
          }
        ]
      },
      "aggregations": {
        "totalSales": {
          "sum": {
            "field": "salesEstimate"
          }
        }
      }
    }
  }

As you may imaging we are facing performance issues, and response time is very slow, higher than 30 seconds in some cases when brand contains a lot of products.

Mapping for ASIN field:

"asin": {
          "type": "keyword",
          "eager_global_ordinals": true
        }

Each weekly index has around 10GB in size and has 1 primary shard and 1 replica shard.

Cluster has 3 master nodes, 4 data nodes, with 4 CPUs and 32 GB of RAM each.

Is there something we can do here in terms of changing cluster configuration or the way that query is built or is it expected that we have poor performances in this case, considering the number of documents and unique ASINs that we store?

Thank you in advance.

Kathleen_DeRusso · September 26, 2024, 3:52pm

From Elastic Search to Elasticsearch

Kathleen_DeRusso · September 26, 2024, 3:52pm

Added aggregations

dominique.bejean · October 16, 2024, 9:20pm

Hi Ivan,
I have the exact same issue.

The field type for our field is keyword so doc_values is enabled and eager_global_ordinals is set to true.

For query returning more than 1 or 2 millions hits, the aggregation is very slow to compute. Profiling reports that GlobalOrdinalsStringTermsAggregator spends more than 99% of time in the collect phase.

Our Elasticsearch cluster is installed in a kubernetes environment with 8 data nodes each with 8 vCPU and 32 Gb RAM (8Gb heap). We don't see disk I/O on elasticsearch nodes as out of heap memory is high and so index caching is efficient.

A possible solution is to surround this aggregation in a sampler aggregation, but it is not really acceptable for us.

Did you find a solution to this issue ?

Regards

Dominique

Mikhail_Khludnev · October 17, 2024, 6:42am

note: 1 shard means only one core proceed the request. So, generally slice more shards, check that all cores are utilized during request.
Then, use date to slice your indices (irc there some built in facilities to do so) ie date filter may be implemented (partially) via picking the right index /sales-2024-08/_search
Cluster is a little bit modest for nowadays. If you are managed to fully utilize it, scale.
Also, this is a data cube problem Spark or Pinot may suits better for it.

dominique.bejean · October 17, 2024, 4:54pm

Hi,

More precision bout the index. The index contains 13 millions of documents. The index is sharded with 32 shards (so 4 shards per elasticsearch node). Each shard contains 400.000 docs and has a size of 12Gb.
Dominique

ivan83 · October 17, 2024, 5:30pm

My use case it that I have index with 45 million documents and size of 17Gb and with only 1 shard.
There is one of those indices for every week, for last 2 years, and we are requesting aggregations for period between 4 and 52 weeks.

ivan83 · October 17, 2024, 5:39pm

No, not yet, we are still facing poor performances for time ranges longer than 4 weeks

dominique.bejean · October 17, 2024, 8:41pm

If it is acceptable, may be you can use sampler aggregations for high cardinality fields.

system · November 14, 2024, 8:41pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Slow terms aggregations after use of eager_global_ordinals Elasticsearch	6	760	November 9, 2020
Terms Aggregation on billions of documents Elasticsearch	2	1221	November 8, 2017
How to avoid or ease terms aggregation for fields with high cardinality? Elasticsearch	4	713	March 12, 2022
Elasticsearch terms aggregation taking 5 seconds on 5 million documents Elasticsearch	7	2001	August 19, 2019
Why is my composite aggregation faster than a terms bucket aggregation for a high cardinality field? Elasticsearch	1	572	May 27, 2020

Terms aggregation on high cardinality field

Related topics