Large composite agg + sorting

Composite aggregation doesn't work very well with bucket_sort aggregation.
I know this was talked and explained lot of times.

My scenario works with a composite aggregation on few fields, some metrics inside (max, min, avg) and then I need to sort on some of those metrics using pagination.
My problem comes that I could have more than 50k results so I have to do pagination (using after).

Then my question is what would be the best solution to achieve this?
Transformations? Rollups? Large size results on on call?

Both rollup and transform are built on top of composite aggregations. So technically all 3 solutions (rollup, transform, custom solution based on composite aggs) are very similar when it comes to the query side.

Rollup and transform persist the result in a secondary index, this has the benefit of doing computations offline and usually results in a speedup in the user application as your query to the secondary index should be faster than querying the source index. Of course this depends on how often you want to run your queries. Rollup has the benefit of combining search for the rolled up index and the source at the same time (rollup search). So it basically speeds up search using the compacted index. Rollup is built around the compaction use case, the idea is to compact the source index and free up the space eventually.

The second reason to use something like rollup or transform is analysis on top of the secondary index, think of it like aggregation running on aggregations. E.g. you have an index around events and want to find the average duration of sessions. You first need to build sessions from the events and as a second step run an aggregations on the sessions. This is conceptually like pipeline aggregations, however works on large data sets, where pipeline aggs run into limitations. For analysis use cases like this, transform provides more freedom than rollup.

Without knowing your use case it seems to me using rollup or transform could help you, as you could run your higher level query on top of the rolled up or transformed index.

If you share some more details, maybe with some example data, I might be able to answer in more detail. Also interesting: data size, volume of incoming new data, estimate on how often you want to run this query, etc.

Thanks for your response Hendrik.

My data is not so much much large but is large like 15MM documents (with daily updates and adds) in total but the aggregations runs over like 50k documents (after query).

Currently I'm just doing a composite aggregation within 5 source fields (terms) to simulate a GROUP BY those fields just to get unique items and paginating with "after" each 25 items.

The problem is I want to be able to sort on other fields including some bucket_script fields (not the source fields of the composite) and it works on each page but not globally.

I technically understand why it happens and it's reasons, so I want to take a look on other ways to achieve it.

Just to add.
I just need to get a dataset of unique items (group by few fields) and paginate/scroll over that.

This sounds like a transform use case to me, because for rollup you need at least one date_histogram, but you have only terms. With a transform you can built an entity centric index around your data. Your sorting requirements can be solved by sorting the search results when you query the transform index.

Yes and I'm doing some test on that.

The only caveat is I have to group by year but with the option of "all years" so I have to do 2 transforms and handle that situation on the app side.
That's because I have some avg fields and also for "all years" can't use another aggs on the transformed index (because paging/sort issues again)

Why don't you use a date_histogram group_by in addition to terms? You can have as many group_by's as you want. Combining the 2 or more years with an aggregation on the transform index is simple.

I think it would really help the discussion, if you can provide some example data and the output you are looking for. No need to leak any internal information, simply mask/abstract the data for the purpose of this discussion.

As an example, these can be the original data

{
  "buyer_id": 1,
  "seller_id": 500,
  "date": "2020-05-01",
  "amount": 75
},
{
  "buyer_id": 2,
  "seller_id": 500,
  "date": "2020-03-04",
  "amount": 34
},{
  "buyer_id": 1,
  "seller_id": 500,
  "date": "2019-03-05",
  "amount": 45
},{
  "buyer_id": 1,
  "seller_id": 500,
  "date": "2019-05-01",
  "amount": 56
},{
  "buyer_id": 1,
  "seller_id": 500,
  "date": "2020-03-01",
  "amount": 44
}

And this the expected output:

buyer_id seller_id date (year) amount.sum
1 500 2020-01-01 119
2 500 2020-01-01 34
1 500 2019-01-01 101
1 500 ALL 220

The ALL row is there because I need to browse the results by YEAR or ALL YEARS but always having the sum/avg etc for the whole selected period.

So I need the sum of 2019+2020 without loosing pagination. That's why I say I need 2 transformations, one for each year and the other for the last option (ALL).

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.