Bucket Sort over Composite aggregation degrading the performance of query - Any better way?

We have a use case where we need to get the top 20 items, which has max sum of revenues.
Lets consider the mapping below:

PUT /my-index/_mapping
{
   "properties": {
        "item": {
                "type": "keyword"
         },
        "type": {
                "type": "keyword",
         },
        "cat-id": {
                "type": "long",
         },
        "revenue": {
               "type": "long",
               "null_value": 0
         }
 }

There could be multiple entries for same item, and goal is to get the top 20 items which has highest revenue sum.
I am using something as similar below -

GET /my-index/_search
{
  "query": {
    "bool": {
      "must": [
                {
                   "term": {
                      "item": {
                      "value": "4044",
                      "boost": 1
                       }
                    }
                 }
       ]
     }
   },
"aggregations": {
   "agg": {
      "composite": {
       "size": 78000000,
         "sources": [
            {
              "query": {
                  "terms": {
                     "field": "item",
                    "order": "asc"
                  }
                }
             }
           ]
        },
       "aggregations": {
        "tot_revenue": {
           "sum": {
              "field": "revenue"
               }
             }
          },
        "sortdata": {
            "bucket_sort": {
                "sort": [
                               {
                                   "tot_revenue": {
                                          "order": "desc"
                                           }
                               }
                         ],
            "from": 0,
            "size": 20,
           "gap_policy": "SKIP"
         }
      }
    }
  }
}

Here each bucket is most likely to have max of 1 to 2 items, but the number of buckets will be really high. This query runs into multiple seconds (8 to 10 secs for almost 3000000 buckets), for large doc count, which I understand is coz.. to sort the aggregated value from each bucket elastic search eventually has to scan all of them.

I was wondering if there is any better way to query this, as such queries are destabilising our cluster when fired in bulk.
I believe Index sort will not much of the use here as sort key is an aggregated value.

I recently read about Filtering values with partition , But I believe its of not much help in the above mentioned scenarios, as even if I split the buckets in partitions, for each partition we will end up having size as number of items in each partition. And then again Client will have to choose top 20 from responses..

Hi Santosh,

goal is to get the top 20 items which has highest revenue sum.

Unless I've overlooked something in your question this should just be a plain terms aggregation ordered by a child sum aggregation:

GET /my-index/_search
{
  "size": 0,
  "aggregations": {
	"agg": {
	  "terms": {
		"size": 20,
		"field": "item",
		"order": {
		  "tot_revenue": "desc"
		}
	  },
	  "aggregations": {
		"tot_revenue": {
		  "sum": {
			"field": "revenue"
		  }
		}
	  }
	}
  }
}

Keep an eye on the doc_count_error_upper_bound field in the results. If this is >0 then consider increasing the shard_size value in the terms aggregation.

Hi Mark,

Thanks for the reply.
The item as I mentioned in my earlier post is not unique. So there will be multiple buckets created item.
While trying out the suggested solution I got following exception :

"root_cause": [
  {
    "type": "aggregation_execution_exception",
    "reason": "Invalid aggregation order path [tot_revenue]. Buckets can only be sorted on a sub-aggregator path that is built out of zero or more single-bucket aggregations within the path and a final single-bucket or a metrics aggregation at the path end."
  }
]

It seems my query is slow because it has to iterate thru all the buckets and then sort them by tot_revenue.

Can you post a cut-down version of the following JSON that recreates the problem:

  1. Your mapping
  2. A handful of example docs
  3. Your query.

Thanks

Hi Mark,

Please find the details below:

Mappings:

{
  "usgm_query_2019-08-27": {
    "aliases": {},
    "mappings": {
      "default": {
        "dynamic": "false",
        "properties": {
          "atc": {
            "type": "long",
            "null_value": 0
          },
          "cat_id": {
            "type": "long"
          },
          "category": {
            "type": "long"
          },
          "clicks": {
            "type": "long",
            "null_value": 0
          },
          "conversions": {
            "type": "long",
            "null_value": 0
          },
          "data_date": {
            "type": "date",
            "format": "yyyy-MM-dd"
          },
          "dept_id": {
            "type": "long"
          },
          "item_impressions": {
            "type": "long",
            "null_value": 0
          },
          "nested_subcat_id1": {
            "type": "long"
          },
          "nested_subcat_id2": {
            "type": "long"
          },
          "normalized_query": {
            "type": "keyword",
            "fields": {
              "like": {
                "type": "text"
              }
            },
            "normalizer": "my_norm"
          },
          "oos": {
            "type": "long",
            "null_value": 0
          },
          "page1_atc": {
            "type": "long",
            "null_value": 0
          },
          "page1_clicks": {
            "type": "long",
            "null_value": 0
          },
          "page1_conversions": {
            "type": "long",
            "null_value": 0
          },
          "page1_item_impressions": {
            "type": "long",
            "null_value": 0
          },
          "page1_oos": {
            "type": "long",
            "null_value": 0
          },
          "page1_page_impressions": {
            "type": "long",
            "null_value": 0
          },
          "page1_revenue": {
            "type": "double",
            "null_value": 0
          },
          "page1_uniq_atc": {
            "type": "long",
            "null_value": 0
          },
          "page1_uniq_clicks": {
            "type": "long",
            "null_value": 0
          },
          "page1_uniq_conversions": {
            "type": "long",
            "null_value": 0
          },
          "page1_uniq_oos": {
            "type": "long",
            "null_value": 0
          },
          "page1_uniq_page_impressions": {
            "type": "long",
            "null_value": 0
          },
          "page_impressions": {
            "type": "long",
            "null_value": 0
          },
          "page_type": {
            "type": "keyword",
            "null_value": "NULL",
            "normalizer": "my_norm"
          },
          "query": {
            "type": "keyword",
            "fields": {
              "like": {
                "type": "text"
              }
            },
            "normalizer": "my_norm"
          },
          "revenue": {
            "type": "double",
            "null_value": 0
          },
          "site": {
            "type": "keyword",
            "null_value": "NULL",
            "normalizer": "my_norm"
          },
          "subcat_id": {
            "type": "long"
          },
          "tenant": {
            "type": "keyword",
            "null_value": "NULL",
            "normalizer": "my_norm"
          },
          "uniq_atc": {
            "type": "long",
            "null_value": 0
          },
          "uniq_clicks": {
            "type": "long",
            "null_value": 0
          },
          "uniq_conversions": {
            "type": "long",
            "null_value": 0
          },
          "uniq_oos": {
            "type": "long",
            "null_value": 0
          },
          "uniq_page_impressions": {
            "type": "long",
            "null_value": 0
          }
        }
      }
    },
    "settings": {
      "index": {
        "mapping": {
          "coerce": "false"
        },
        "number_of_shards": "1",
        "provided_name": "usgm_query_2019-08-27",
        "creation_date": "1566878778880",
        "unassigned": {
          "node_left": {
            "delayed_timeout": "5m"
          }
        },
        "analysis": {
          "normalizer": {
            "my_norm": {
              "filter": [
                "lowercase",
                "asciifolding"
              ],
              "type": "custom"
            }
          }
        },
        "number_of_replicas": "2",
        "uuid": "j2y1Hi1AScOGtV_gTjw9FA",
        "version": {
          "created": "6010299"
        }
      }
    }
  }
}

Example Docs:

{
"took": 21,
"timed_out": false,
"_shards": {
  "total": 1,
  "successful": 1,
  "skipped": 0,
  "failed": 0
},
"hits": {
  "total": 2705760,
  "max_score": 1,
  "hits": [
    {
      "_index": "usgm_query_2019-08-26",
      "_type": "default",
      "_id": "kRm-22wBUf0sDgpkTtO3",
      "_score": 1,
      "_source": {
        "page1_clicks": 0,
        "page1_uniq_clicks": 0,
        "conversions": 0,
        "page1_page_impressions": 2,
        "page_type": "search",
        "item_impressions": 0,
        "page1_revenue": 0,
        "page1_oos": 0,
        "page1_uniq_oos": 0,
        "page1_atc": 0,
        "atc": 0,
        "page1_uniq_atc": 0,
        "oos": 0,
        "revenue": 0,
        "uniq_conversions": 0,
        "cat_id": 0,
        "page1_conversions": 0,
        "tenant": "usgm",
        "page1_item_impressions": 0,
        "subcat_id": 0,
        "uniq_clicks": 0,
        "query": "ollieoversizedtop-1x-peach",
        "page_impressions": 2,
        "uniq_page_impressions": 0,
        "data_date": "2019-08-26",
        "nested_subcat_id1": 0,
        "page1_uniq_conversions": 0,
        "nested_subcat_id2": 0,
        "site": "uswm_en",
        "page1_uniq_page_impressions": 0,
        "normalized_query": "ollieoversizedtop 1x peach",
        "clicks": 0,
        "uniq_atc": 0,
        "dept_id": 0,
        "uniq_oos": 0
      }
    },
    {
      "_index": "usgm_query_2019-08-26",
      "_type": "default",
      "_id": "BUi-22wB0DCa4RSgTde_",
      "_score": 1,
      "_source": {
        "page1_clicks": 0,
        "page1_uniq_clicks": 0,
        "conversions": 0,
        "page1_page_impressions": 1,
        "page_type": "search",
        "item_impressions": 40,
        "page1_revenue": 0,
        "page1_oos": 0,
        "page1_uniq_oos": 0,
        "page1_atc": 0,
        "atc": 0,
        "page1_uniq_atc": 0,
        "oos": 0,
        "revenue": 0,
        "uniq_conversions": 0,
        "cat_id": 623679,
        "page1_conversions": 0,
        "tenant": "usgm",
        "page1_item_impressions": 40,
        "subcat_id": 0,
        "uniq_clicks": 0,
        "query": "yellow and white checkered tablecloth",
        "page_impressions": 1,
        "uniq_page_impressions": 1,
        "data_date": "2019-08-26",
        "nested_subcat_id1": 0,
        "page1_uniq_conversions": 0,
        "nested_subcat_id2": 0,
        "site": "uswm_en",
        "page1_uniq_page_impressions": 1,
        "normalized_query": "yellow and white checker tablecloth",
        "clicks": 0,
        "uniq_atc": 0,
        "dept_id": 4044,
        "uniq_oos": 0
      }
    },
    {
      "_index": "usgm_query_2019-08-26",
      "_type": "default",
      "_id": "khm-22wBUf0sDgpkTtO3",
      "_score": 1,
      "_source": {
        "page1_clicks": 0,
        "page1_uniq_clicks": 0,
        "conversions": 0,
        "page1_page_impressions": 1,
        "page_type": "search",
        "item_impressions": 60,
        "page1_revenue": 0,
        "page1_oos": 0,
        "page1_uniq_oos": 0,
        "page1_atc": 0,
        "atc": 0,
        "page1_uniq_atc": 0,
        "oos": 0,
        "revenue": 0,
        "uniq_conversions": 0,
        "cat_id": 0,
        "page1_conversions": 0,
        "tenant": "usgm",
        "page1_item_impressions": 20,
        "subcat_id": 0,
        "uniq_clicks": 0,
        "query": "pointless",
        "page_impressions": 3,
        "uniq_page_impressions": 3,
        "data_date": "2019-08-26",
        "nested_subcat_id1": 0,
        "page1_uniq_conversions": 0,
        "nested_subcat_id2": 0,
        "site": "uswm_en",
        "page1_uniq_page_impressions": 1,
        "normalized_query": "pointless",
        "clicks": 0,
        "uniq_atc": 0,
        "dept_id": 0,
        "uniq_oos": 0
      }
    },
    {
      "_index": "usgm_query_2019-08-26",
      "_type": "default",
      "_id": "kxm-22wBUf0sDgpkTtO3",
      "_score": 1,
      "_source": {
        "page1_clicks": 0,
        "page1_uniq_clicks": 0,
        "conversions": 0,
        "page1_page_impressions": 2,
        "page_type": "search",
        "item_impressions": 0,
        "page1_revenue": 0,
        "page1_oos": 0,
        "page1_uniq_oos": 0,
        "page1_atc": 0,
        "atc": 0,
        "page1_uniq_atc": 0,
        "oos": 0,
        "revenue": 0,
        "uniq_conversions": 0,
        "cat_id": 0,
        "page1_conversions": 0,
        "tenant": "usgm",
        "page1_item_impressions": 0,
        "subcat_id": 0,
        "uniq_clicks": 0,
        "query": "mon849pink3x5",
        "page_impressions": 2,
        "uniq_page_impressions": 0,
        "data_date": "2019-08-26",
        "nested_subcat_id1": 0,
        "page1_uniq_conversions": 0,
        "nested_subcat_id2": 0,
        "site": "uswm_en",
        "page1_uniq_page_impressions": 0,
        "normalized_query": "mon849pink3x5",
        "clicks": 0,
        "uniq_atc": 0,
        "dept_id": 0,
        "uniq_oos": 0
      }
    }
    }
    }
  ]
}

}

Query that I am using:

{
"query": {
  "bool": {
    "must": [
      {
        "bool": {
          "should": [
            {
              "term": {
                "dept_id": {
                  "value": "4044",
                  "boost": 1
                }
              }
            },
            {
              "term": {
                "nested_subcat_id1": {
                  "value": "3206242",
                  "boost": 1
                }
              }
            },
            {
              "term": {
                "subcat_id": {
                  "value": "6182459",
                  "boost": 1
                }
              }
            },
            {
              "term": {
                "nested_subcat_id1": {
                  "value": "4818720",
                  "boost": 1
                }
              }
            }
          ],
          "adjust_pure_negative": true,
          "boost": 1
        }
      },
      {
        "term": {
          "page_type": {
            "value": "search",
            "boost": 1
          }
        }
      }
    ],
    "adjust_pure_negative": true,
    "boost": 1
  }
},
"aggregations": {
  "agg": {
    "composite": {
      "size": 78000000,
      "sources": [
        {
          "query": {
            "terms": {
              "field": "query",
              "order": "asc"
            }
          }
        }
      ]
    },
    "aggregations": {
      "atc": {
        "sum": {
          "field": "atc"
        }
      },
      "conversions": {
        "sum": {
          "field": "conversions"
        }
      },
      "revenue": {
        "sum": {
          "field": "revenue"
        }
      },
      "clicks": {
        "sum": {
          "field": "clicks"
        }
      },
      "page_impressions": {
        "sum": {
          "field": "page_impressions"
        }
      },
      "pvr_rate": {
        "bucket_script": {
          "buckets_path": {
            "clicks": "clicks",
            "page_impressions": "page_impressions"
          },
          "script": {
            "source": "(params.clicks/params.page_impressions)*100",
            "lang": "painless"
          },
          "gap_policy": "skip"
        }
      },
      "atc_rate": {
        "bucket_script": {
          "buckets_path": {
            "atc": "atc",
            "page_impressions": "page_impressions"
          },
          "script": {
            "source": "(params.atc/params.page_impressions)*100",
            "lang": "painless"
          },
          "gap_policy": "skip"
        }
      },
      "conversion_rate": {
        "bucket_script": {
          "buckets_path": {
            "conversions": "conversions",
            "page_impressions": "page_impressions"
          },
          "script": {
            "source": "(params.conversions/params.page_impressions)*100",
            "lang": "painless"
          },
          "gap_policy": "skip"
        }
      },
      "sortdata": {
        "bucket_sort": {
          "sort": [
            {
              "page_impressions": {
                "order": "desc"
              }
            }
          ],
          "from": 0,
          "size": 20,
          "gap_policy": "SKIP"
        }
      }
    }
  }
}
}

This is kind of legacy that i got to work with, But am trying to improve the performance here, without much of change. We have daily indices, and each index has around 2415887 docs.

Really appreciate your help.

Thanks
Santosh

Thanks but that query seems different to your question (sorting by page impressions vs revenue?)

If you can distill the examples down to the bare minimum set of fields required to demonstrate the problem we'll have a better start point.

Actually I tried to provide simplified version of this in my first post. And hence had renamed few fields.
The problem in short is I need to group by query, and aggregate few values, page-impression being one of them.. and finally choose top 20 queries, order by page-impression.

OK. Repeating my answer then with a renamed choice of fields:

GET /test/_search
{
  "size": 0,
  "aggregations": {
	"agg": {
	  "terms": {
		"size": 20,
		"field": "query",
		"order": {
		  "sum_page_impressions": "desc"
		}
	  },
	  "aggregations": {
		"sum_page_impressions": {
		  "sum": {
			"field": "page_impressions"
		  }
		}
	  }
	}
  }
}

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.