Large composite agg + sorting

joropito · October 15, 2020, 3:17am

Composite aggregation doesn't work very well with bucket_sort aggregation.
I know this was talked and explained lot of times.

My scenario works with a composite aggregation on few fields, some metrics inside (max, min, avg) and then I need to sort on some of those metrics using pagination.
My problem comes that I could have more than 50k results so I have to do pagination (using after).

Then my question is what would be the best solution to achieve this?
Transformations? Rollups? Large size results on on call?

Hendrik_Muhs · October 15, 2020, 6:34am

Both rollup and transform are built on top of composite aggregations. So technically all 3 solutions (rollup, transform, custom solution based on composite aggs) are very similar when it comes to the query side.

Rollup and transform persist the result in a secondary index, this has the benefit of doing computations offline and usually results in a speedup in the user application as your query to the secondary index should be faster than querying the source index. Of course this depends on how often you want to run your queries. Rollup has the benefit of combining search for the rolled up index and the source at the same time (rollup search). So it basically speeds up search using the compacted index. Rollup is built around the compaction use case, the idea is to compact the source index and free up the space eventually.

The second reason to use something like rollup or transform is analysis on top of the secondary index, think of it like aggregation running on aggregations. E.g. you have an index around events and want to find the average duration of sessions. You first need to build sessions from the events and as a second step run an aggregations on the sessions. This is conceptually like pipeline aggregations, however works on large data sets, where pipeline aggs run into limitations. For analysis use cases like this, transform provides more freedom than rollup.

Without knowing your use case it seems to me using rollup or transform could help you, as you could run your higher level query on top of the rolled up or transformed index.

If you share some more details, maybe with some example data, I might be able to answer in more detail. Also interesting: data size, volume of incoming new data, estimate on how often you want to run this query, etc.

joropito · October 15, 2020, 10:28pm

Thanks for your response Hendrik.

My data is not so much much large but is large like 15MM documents (with daily updates and adds) in total but the aggregations runs over like 50k documents (after query).

Currently I'm just doing a composite aggregation within 5 source fields (terms) to simulate a GROUP BY those fields just to get unique items and paginating with "after" each 25 items.

The problem is I want to be able to sort on other fields including some bucket_script fields (not the source fields of the composite) and it works on each page but not globally.

I technically understand why it happens and it's reasons, so I want to take a look on other ways to achieve it.

joropito · October 15, 2020, 10:29pm

Just to add.
I just need to get a dataset of unique items (group by few fields) and paginate/scroll over that.

Hendrik_Muhs · October 16, 2020, 6:37am

This sounds like a transform use case to me, because for rollup you need at least one date_histogram, but you have only terms. With a transform you can built an entity centric index around your data. Your sorting requirements can be solved by sorting the search results when you query the transform index.

joropito · October 16, 2020, 12:22pm

Yes and I'm doing some test on that.

The only caveat is I have to group by year but with the option of "all years" so I have to do 2 transforms and handle that situation on the app side.
That's because I have some avg fields and also for "all years" can't use another aggs on the transformed index (because paging/sort issues again)

Hendrik_Muhs · October 16, 2020, 2:36pm

Why don't you use a date_histogram group_by in addition to terms? You can have as many group_by's as you want. Combining the 2 or more years with an aggregation on the transform index is simple.

I think it would really help the discussion, if you can provide some example data and the output you are looking for. No need to leak any internal information, simply mask/abstract the data for the purpose of this discussion.

joropito · October 16, 2020, 4:19pm

As an example, these can be the original data

{
  "buyer_id": 1,
  "seller_id": 500,
  "date": "2020-05-01",
  "amount": 75
},
{
  "buyer_id": 2,
  "seller_id": 500,
  "date": "2020-03-04",
  "amount": 34
},{
  "buyer_id": 1,
  "seller_id": 500,
  "date": "2019-03-05",
  "amount": 45
},{
  "buyer_id": 1,
  "seller_id": 500,
  "date": "2019-05-01",
  "amount": 56
},{
  "buyer_id": 1,
  "seller_id": 500,
  "date": "2020-03-01",
  "amount": 44
}

And this the expected output:

buyer_id	seller_id	date (year)	amount.sum
1	500	2020-01-01	119
2	500	2020-01-01	34
1	500	2019-01-01	101
1	500	ALL	220

The ALL row is there because I need to browse the results by YEAR or ALL YEARS but always having the sum/avg etc for the whole selected period.

So I need the sum of 2019+2020 without loosing pagination. That's why I say I need 2 transformations, one for each year and the other for the last option (ALL).

system · November 13, 2020, 4:19pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Sorting, Paginating the aggregated data Elasticsearch	12	830	August 6, 2020
Composite Aggregation and Sorting with a non source Elasticsearch	2	1710	June 15, 2018
Composite aggregation ordered by time over entire data in database Elasticsearch	10	1715	May 11, 2020
Pagination + Sorted Aggregations: Efficiently Retrieve Sorted List of Values? Elasticsearch	2	226	August 23, 2023
Sorted Pagination on Composite Aggregation Elasticsearch	2	511	July 27, 2022

Large composite agg + sorting

Related topics