We have a use case where we need to get the top 20 items, which has max sum of revenues.
Lets consider the mapping below:
PUT /my-index/_mapping
{
"properties": {
"item": {
"type": "keyword"
},
"type": {
"type": "keyword",
},
"cat-id": {
"type": "long",
},
"revenue": {
"type": "long",
"null_value": 0
}
}
There could be multiple entries for same item, and goal is to get the top 20 items which has highest revenue sum.
I am using something as similar below -
GET /my-index/_search
{
"query": {
"bool": {
"must": [
{
"term": {
"item": {
"value": "4044",
"boost": 1
}
}
}
]
}
},
"aggregations": {
"agg": {
"composite": {
"size": 78000000,
"sources": [
{
"query": {
"terms": {
"field": "item",
"order": "asc"
}
}
}
]
},
"aggregations": {
"tot_revenue": {
"sum": {
"field": "revenue"
}
}
},
"sortdata": {
"bucket_sort": {
"sort": [
{
"tot_revenue": {
"order": "desc"
}
}
],
"from": 0,
"size": 20,
"gap_policy": "SKIP"
}
}
}
}
}
Here each bucket is most likely to have max of 1 to 2 items, but the number of buckets will be really high. This query runs into multiple seconds (8 to 10 secs for almost 3000000 buckets), for large doc count, which I understand is coz.. to sort the aggregated value from each bucket elastic search eventually has to scan all of them.
I was wondering if there is any better way to query this, as such queries are destabilising our cluster when fired in bulk.
I believe Index sort will not much of the use here as sort key is an aggregated value.