Hello!
I am sharing the issue that I am having here hoping that someone can help me.
We are executing terms on a field which has super high cardinality. I am talking about 40 million unique entries that represents product unique identifier (ASIN).
For each of those product identifier, we have weekly data, that we need to aggregate based on different criteria, such are brand, category, time range and so on. Basically we are having 40 million of documents per week.
Example of the query is following:
{
"size": 0,
"query": {
"bool": {
"filter": [
{
"range": {
"dateFrom": {
"from": "2024-08-25",
"to": "2024-09-21",
"include_lower": true,
"include_upper": false,
"boost": 1
}
}
},
{
"terms": {
"brand.keyword": [
"Nike"
],
"boost": 1
}
}
],
"adjust_pure_negative": true,
"boost": 1
}
},
"_source": false,
"track_total_hits": -1,
"aggregations": {
"asins": {
"terms": {
"field": "asin",
"size": 1000,
"min_doc_count": 1,
"shard_min_doc_count": 0,
"show_term_doc_count_error": false,
"order": [
{
"totalSales": "desc"
},
{
"_key": "asc"
}
]
},
"aggregations": {
"totalSales": {
"sum": {
"field": "salesEstimate"
}
}
}
}
}
As you may imaging we are facing performance issues, and response time is very slow, higher than 30 seconds in some cases when brand contains a lot of products.
Mapping for ASIN field:
"asin": {
"type": "keyword",
"eager_global_ordinals": true
}
Each weekly index has around 10GB in size and has 1 primary shard and 1 replica shard.
Cluster has 3 master nodes, 4 data nodes, with 4 CPUs and 32 GB of RAM each.
Is there something we can do here in terms of changing cluster configuration or the way that query is built or is it expected that we have poor performances in this case, considering the number of documents and unique ASINs that we store?
Thank you in advance.