Bucket Sort over Composite aggregation degrading the performance of query - Any better way?

Santosh_Singh1 · September 17, 2019, 8:10am

We have a use case where we need to get the top 20 items, which has max sum of revenues.
Lets consider the mapping below:

PUT /my-index/_mapping
{
   "properties": {
        "item": {
                "type": "keyword"
         },
        "type": {
                "type": "keyword",
         },
        "cat-id": {
                "type": "long",
         },
        "revenue": {
               "type": "long",
               "null_value": 0
         }
 }

There could be multiple entries for same item, and goal is to get the top 20 items which has highest revenue sum.
I am using something as similar below -

GET /my-index/_search
{
  "query": {
    "bool": {
      "must": [
                {
                   "term": {
                      "item": {
                      "value": "4044",
                      "boost": 1
                       }
                    }
                 }
       ]
     }
   },
"aggregations": {
   "agg": {
      "composite": {
       "size": 78000000,
         "sources": [
            {
              "query": {
                  "terms": {
                     "field": "item",
                    "order": "asc"
                  }
                }
             }
           ]
        },
       "aggregations": {
        "tot_revenue": {
           "sum": {
              "field": "revenue"
               }
             }
          },
        "sortdata": {
            "bucket_sort": {
                "sort": [
                               {
                                   "tot_revenue": {
                                          "order": "desc"
                                           }
                               }
                         ],
            "from": 0,
            "size": 20,
           "gap_policy": "SKIP"
         }
      }
    }
  }
}

Here each bucket is most likely to have max of 1 to 2 items, but the number of buckets will be really high. This query runs into multiple seconds (8 to 10 secs for almost 3000000 buckets), for large doc count, which I understand is coz.. to sort the aggregated value from each bucket elastic search eventually has to scan all of them.

I was wondering if there is any better way to query this, as such queries are destabilising our cluster when fired in bulk.
I believe Index sort will not much of the use here as sort key is an aggregated value.

Santosh_Singh1 · September 17, 2019, 9:15am

I recently read about Filtering values with partition , But I believe its of not much help in the above mentioned scenarios, as even if I split the buckets in partitions, for each partition we will end up having size as number of items in each partition. And then again Client will have to choose top 20 from responses..

Mark_Harwood · September 24, 2019, 11:08am

Hi Santosh,

goal is to get the top 20 items which has highest revenue sum.

Unless I've overlooked something in your question this should just be a plain terms aggregation ordered by a child sum aggregation:

GET /my-index/_search
{
  "size": 0,
  "aggregations": {
	"agg": {
	  "terms": {
		"size": 20,
		"field": "item",
		"order": {
		  "tot_revenue": "desc"
		}
	  },
	  "aggregations": {
		"tot_revenue": {
		  "sum": {
			"field": "revenue"
		  }
		}
	  }
	}
  }
}

Keep an eye on the doc_count_error_upper_bound field in the results. If this is >0 then consider increasing the shard_size value in the terms aggregation.

Santosh_Singh1 · September 24, 2019, 1:10pm

Hi Mark,

Thanks for the reply.
The item as I mentioned in my earlier post is not unique. So there will be multiple buckets created item.
While trying out the suggested solution I got following exception :

"root_cause": [
  {
    "type": "aggregation_execution_exception",
    "reason": "Invalid aggregation order path [tot_revenue]. Buckets can only be sorted on a sub-aggregator path that is built out of zero or more single-bucket aggregations within the path and a final single-bucket or a metrics aggregation at the path end."
  }
]

It seems my query is slow because it has to iterate thru all the buckets and then sort them by tot_revenue.

Mark_Harwood · September 24, 2019, 1:13pm

Can you post a cut-down version of the following JSON that recreates the problem:

Your mapping
A handful of example docs
Your query.

Thanks

Santosh_Singh1 · September 24, 2019, 1:38pm

Hi Mark,

Please find the details below:

Mappings:

{
  "usgm_query_2019-08-27": {
    "aliases": {},
    "mappings": {
      "default": {
        "dynamic": "false",
        "properties": {
          "atc": {
            "type": "long",
            "null_value": 0
          },
          "cat_id": {
            "type": "long"
          },
          "category": {
            "type": "long"
          },
          "clicks": {
            "type": "long",
            "null_value": 0
          },
          "conversions": {
            "type": "long",
            "null_value": 0
          },
          "data_date": {
            "type": "date",
            "format": "yyyy-MM-dd"
          },
          "dept_id": {
            "type": "long"
          },
          "item_impressions": {
            "type": "long",
            "null_value": 0
          },
          "nested_subcat_id1": {
            "type": "long"
          },
          "nested_subcat_id2": {
            "type": "long"
          },
          "normalized_query": {
            "type": "keyword",
            "fields": {
              "like": {
                "type": "text"
              }
            },
            "normalizer": "my_norm"
          },
          "oos": {
            "type": "long",
            "null_value": 0
          },
          "page1_atc": {
            "type": "long",
            "null_value": 0
          },
          "page1_clicks": {
            "type": "long",
            "null_value": 0
          },
          "page1_conversions": {
            "type": "long",
            "null_value": 0
          },
          "page1_item_impressions": {
            "type": "long",
            "null_value": 0
          },
          "page1_oos": {
            "type": "long",
            "null_value": 0
          },
          "page1_page_impressions": {
            "type": "long",
            "null_value": 0
          },
          "page1_revenue": {
            "type": "double",
            "null_value": 0
          },
          "page1_uniq_atc": {
            "type": "long",
            "null_value": 0
          },
          "page1_uniq_clicks": {
            "type": "long",
            "null_value": 0
          },
          "page1_uniq_conversions": {
            "type": "long",
            "null_value": 0
          },
          "page1_uniq_oos": {
            "type": "long",
            "null_value": 0
          },
          "page1_uniq_page_impressions": {
            "type": "long",
            "null_value": 0
          },
          "page_impressions": {
            "type": "long",
            "null_value": 0
          },
          "page_type": {
            "type": "keyword",
            "null_value": "NULL",
            "normalizer": "my_norm"
          },
          "query": {
            "type": "keyword",
            "fields": {
              "like": {
                "type": "text"
              }
            },
            "normalizer": "my_norm"
          },
          "revenue": {
            "type": "double",
            "null_value": 0
          },
          "site": {
            "type": "keyword",
            "null_value": "NULL",
            "normalizer": "my_norm"
          },
          "subcat_id": {
            "type": "long"
          },
          "tenant": {
            "type": "keyword",
            "null_value": "NULL",
            "normalizer": "my_norm"
          },
          "uniq_atc": {
            "type": "long",
            "null_value": 0
          },
          "uniq_clicks": {
            "type": "long",
            "null_value": 0
          },
          "uniq_conversions": {
            "type": "long",
            "null_value": 0
          },
          "uniq_oos": {
            "type": "long",
            "null_value": 0
          },
          "uniq_page_impressions": {
            "type": "long",
            "null_value": 0
          }
        }
      }
    },
    "settings": {
      "index": {
        "mapping": {
          "coerce": "false"
        },
        "number_of_shards": "1",
        "provided_name": "usgm_query_2019-08-27",
        "creation_date": "1566878778880",
        "unassigned": {
          "node_left": {
            "delayed_timeout": "5m"
          }
        },
        "analysis": {
          "normalizer": {
            "my_norm": {
              "filter": [
                "lowercase",
                "asciifolding"
              ],
              "type": "custom"
            }
          }
        },
        "number_of_replicas": "2",
        "uuid": "j2y1Hi1AScOGtV_gTjw9FA",
        "version": {
          "created": "6010299"
        }
      }
    }
  }
}

Santosh_Singh1 · September 24, 2019, 1:40pm

Example Docs:

{
"took": 21,
"timed_out": false,
"_shards": {
  "total": 1,
  "successful": 1,
  "skipped": 0,
  "failed": 0
},
"hits": {
  "total": 2705760,
  "max_score": 1,
  "hits": [
    {
      "_index": "usgm_query_2019-08-26",
      "_type": "default",
      "_id": "kRm-22wBUf0sDgpkTtO3",
      "_score": 1,
      "_source": {
        "page1_clicks": 0,
        "page1_uniq_clicks": 0,
        "conversions": 0,
        "page1_page_impressions": 2,
        "page_type": "search",
        "item_impressions": 0,
        "page1_revenue": 0,
        "page1_oos": 0,
        "page1_uniq_oos": 0,
        "page1_atc": 0,
        "atc": 0,
        "page1_uniq_atc": 0,
        "oos": 0,
        "revenue": 0,
        "uniq_conversions": 0,
        "cat_id": 0,
        "page1_conversions": 0,
        "tenant": "usgm",
        "page1_item_impressions": 0,
        "subcat_id": 0,
        "uniq_clicks": 0,
        "query": "ollieoversizedtop-1x-peach",
        "page_impressions": 2,
        "uniq_page_impressions": 0,
        "data_date": "2019-08-26",
        "nested_subcat_id1": 0,
        "page1_uniq_conversions": 0,
        "nested_subcat_id2": 0,
        "site": "uswm_en",
        "page1_uniq_page_impressions": 0,
        "normalized_query": "ollieoversizedtop 1x peach",
        "clicks": 0,
        "uniq_atc": 0,
        "dept_id": 0,
        "uniq_oos": 0
      }
    },
    {
      "_index": "usgm_query_2019-08-26",
      "_type": "default",
      "_id": "BUi-22wB0DCa4RSgTde_",
      "_score": 1,
      "_source": {
        "page1_clicks": 0,
        "page1_uniq_clicks": 0,
        "conversions": 0,
        "page1_page_impressions": 1,
        "page_type": "search",
        "item_impressions": 40,
        "page1_revenue": 0,
        "page1_oos": 0,
        "page1_uniq_oos": 0,
        "page1_atc": 0,
        "atc": 0,
        "page1_uniq_atc": 0,
        "oos": 0,
        "revenue": 0,
        "uniq_conversions": 0,
        "cat_id": 623679,
        "page1_conversions": 0,
        "tenant": "usgm",
        "page1_item_impressions": 40,
        "subcat_id": 0,
        "uniq_clicks": 0,
        "query": "yellow and white checkered tablecloth",
        "page_impressions": 1,
        "uniq_page_impressions": 1,
        "data_date": "2019-08-26",
        "nested_subcat_id1": 0,
        "page1_uniq_conversions": 0,
        "nested_subcat_id2": 0,
        "site": "uswm_en",
        "page1_uniq_page_impressions": 1,
        "normalized_query": "yellow and white checker tablecloth",
        "clicks": 0,
        "uniq_atc": 0,
        "dept_id": 4044,
        "uniq_oos": 0
      }
    },
    {
      "_index": "usgm_query_2019-08-26",
      "_type": "default",
      "_id": "khm-22wBUf0sDgpkTtO3",
      "_score": 1,
      "_source": {
        "page1_clicks": 0,
        "page1_uniq_clicks": 0,
        "conversions": 0,
        "page1_page_impressions": 1,
        "page_type": "search",
        "item_impressions": 60,
        "page1_revenue": 0,
        "page1_oos": 0,
        "page1_uniq_oos": 0,
        "page1_atc": 0,
        "atc": 0,
        "page1_uniq_atc": 0,
        "oos": 0,
        "revenue": 0,
        "uniq_conversions": 0,
        "cat_id": 0,
        "page1_conversions": 0,
        "tenant": "usgm",
        "page1_item_impressions": 20,
        "subcat_id": 0,
        "uniq_clicks": 0,
        "query": "pointless",
        "page_impressions": 3,
        "uniq_page_impressions": 3,
        "data_date": "2019-08-26",
        "nested_subcat_id1": 0,
        "page1_uniq_conversions": 0,
        "nested_subcat_id2": 0,
        "site": "uswm_en",
        "page1_uniq_page_impressions": 1,
        "normalized_query": "pointless",
        "clicks": 0,
        "uniq_atc": 0,
        "dept_id": 0,
        "uniq_oos": 0
      }
    },
    {
      "_index": "usgm_query_2019-08-26",
      "_type": "default",
      "_id": "kxm-22wBUf0sDgpkTtO3",
      "_score": 1,
      "_source": {
        "page1_clicks": 0,
        "page1_uniq_clicks": 0,
        "conversions": 0,
        "page1_page_impressions": 2,
        "page_type": "search",
        "item_impressions": 0,
        "page1_revenue": 0,
        "page1_oos": 0,
        "page1_uniq_oos": 0,
        "page1_atc": 0,
        "atc": 0,
        "page1_uniq_atc": 0,
        "oos": 0,
        "revenue": 0,
        "uniq_conversions": 0,
        "cat_id": 0,
        "page1_conversions": 0,
        "tenant": "usgm",
        "page1_item_impressions": 0,
        "subcat_id": 0,
        "uniq_clicks": 0,
        "query": "mon849pink3x5",
        "page_impressions": 2,
        "uniq_page_impressions": 0,
        "data_date": "2019-08-26",
        "nested_subcat_id1": 0,
        "page1_uniq_conversions": 0,
        "nested_subcat_id2": 0,
        "site": "uswm_en",
        "page1_uniq_page_impressions": 0,
        "normalized_query": "mon849pink3x5",
        "clicks": 0,
        "uniq_atc": 0,
        "dept_id": 0,
        "uniq_oos": 0
      }
    }
    }
    }
  ]
}

}

Santosh_Singh1 · September 24, 2019, 1:41pm

Query that I am using:

{
"query": {
  "bool": {
    "must": [
      {
        "bool": {
          "should": [
            {
              "term": {
                "dept_id": {
                  "value": "4044",
                  "boost": 1
                }
              }
            },
            {
              "term": {
                "nested_subcat_id1": {
                  "value": "3206242",
                  "boost": 1
                }
              }
            },
            {
              "term": {
                "subcat_id": {
                  "value": "6182459",
                  "boost": 1
                }
              }
            },
            {
              "term": {
                "nested_subcat_id1": {
                  "value": "4818720",
                  "boost": 1
                }
              }
            }
          ],
          "adjust_pure_negative": true,
          "boost": 1
        }
      },
      {
        "term": {
          "page_type": {
            "value": "search",
            "boost": 1
          }
        }
      }
    ],
    "adjust_pure_negative": true,
    "boost": 1
  }
},
"aggregations": {
  "agg": {
    "composite": {
      "size": 78000000,
      "sources": [
        {
          "query": {
            "terms": {
              "field": "query",
              "order": "asc"
            }
          }
        }
      ]
    },
    "aggregations": {
      "atc": {
        "sum": {
          "field": "atc"
        }
      },
      "conversions": {
        "sum": {
          "field": "conversions"
        }
      },
      "revenue": {
        "sum": {
          "field": "revenue"
        }
      },
      "clicks": {
        "sum": {
          "field": "clicks"
        }
      },
      "page_impressions": {
        "sum": {
          "field": "page_impressions"
        }
      },
      "pvr_rate": {
        "bucket_script": {
          "buckets_path": {
            "clicks": "clicks",
            "page_impressions": "page_impressions"
          },
          "script": {
            "source": "(params.clicks/params.page_impressions)*100",
            "lang": "painless"
          },
          "gap_policy": "skip"
        }
      },
      "atc_rate": {
        "bucket_script": {
          "buckets_path": {
            "atc": "atc",
            "page_impressions": "page_impressions"
          },
          "script": {
            "source": "(params.atc/params.page_impressions)*100",
            "lang": "painless"
          },
          "gap_policy": "skip"
        }
      },
      "conversion_rate": {
        "bucket_script": {
          "buckets_path": {
            "conversions": "conversions",
            "page_impressions": "page_impressions"
          },
          "script": {
            "source": "(params.conversions/params.page_impressions)*100",
            "lang": "painless"
          },
          "gap_policy": "skip"
        }
      },
      "sortdata": {
        "bucket_sort": {
          "sort": [
            {
              "page_impressions": {
                "order": "desc"
              }
            }
          ],
          "from": 0,
          "size": 20,
          "gap_policy": "SKIP"
        }
      }
    }
  }
}
}

This is kind of legacy that i got to work with, But am trying to improve the performance here, without much of change. We have daily indices, and each index has around 2415887 docs.

Really appreciate your help.

Thanks
Santosh

Mark_Harwood · September 24, 2019, 1:57pm

Thanks but that query seems different to your question (sorting by page impressions vs revenue?)

If you can distill the examples down to the bare minimum set of fields required to demonstrate the problem we'll have a better start point.

Santosh_Singh1 · September 24, 2019, 4:35pm

Actually I tried to provide simplified version of this in my first post. And hence had renamed few fields.
The problem in short is I need to group by query, and aggregate few values, page-impression being one of them.. and finally choose top 20 queries, order by page-impression.

Mark_Harwood · September 24, 2019, 4:42pm

OK. Repeating my answer then with a renamed choice of fields:

GET /test/_search
{
  "size": 0,
  "aggregations": {
	"agg": {
	  "terms": {
		"size": 20,
		"field": "query",
		"order": {
		  "sum_page_impressions": "desc"
		}
	  },
	  "aggregations": {
		"sum_page_impressions": {
		  "sum": {
			"field": "page_impressions"
		  }
		}
	  }
	}
  }
}

system · October 22, 2019, 4:42pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Composite aggregation query with bucket_sort does not work properly Elasticsearch	1	204	August 28, 2022
Sorting results from composite aggregation Elasticsearch	14	3300	August 3, 2020
Sort by top_hits aggregation Elastic Search elastic-app-search	1	149	March 18, 2024
Composite aggregation ordered by time over entire data in database Elasticsearch	10	1562	May 11, 2020
Elasticsearch 6.2 - How to specify bucket_path with Composite Aggregations - How to specify Key? Elasticsearch	5	1233	September 2, 2018

Bucket Sort over Composite aggregation degrading the performance of query - Any better way?

Related topics