Hi, I'm looking for some advice on the best way to implement an aggregation query that supports pagination and sorting.
Quick Overview of My Documents & Desired Use Case:
I have indexes that contain documents with a number of relevant fields
{ user_id: 5, timestamp: 123456789, ... other fields ... },
{ user_id: 6, timestamp: 123456791, ... other fields ... }
{ user_id: 8, timestamp: 123456811, ... other fields ... }
{ user_id: 5, timestamp: 1234567891, ... other fields ... }
I'm trying to implement an aggregation query that will return a list of user_ids sorted by most recent timestamp. Ideally, this aggregation will support pagination as user_id
is a high cardinality field and there may be many documents that share the same user_id. I want to avoid pulling all of the results at once due to speed and memory concerns.
For Example: Let's say I only want 1 user_id per page of results & use the example documents I provided above.
Page 1 -> user_id = 5
Page 2 -> user_id = 8
Page 3 -> user_id = 6
What I've Tried so Far
- Term/s Aggregation
- Composite Aggregation
Term/s aggregations have good support for sorting the result buckets & avoid duplicate user_id results by grouping all documents containing the same user_id into a bucket.
"users": {
"terms": {
"field": "user_id",
"order": { "user_latest_timestamp.max": "desc" },
"size": 200
},
"aggs": {
"user_latest_timestamp": {
"stats": { "field": "timestamp" }
}
}
}
The problem is that as far as I can tell, terms aggregations do not support pagination. If the index I'm running this query against contains 5,000+ unique user_ids, then size
must be set to some arbitrary value (5,000). The outcome is a slow query that returns 100s of KBs of data.
Composite aggregations are obviously the way to go for paginating aggregation responses, but I can't seem to get the sorting to work for my use case.
"users": {
"composite": {
"size": 200,
"sources": [
{ "user_ids": { "terms": { "field": "user_id" } } }
]
The first composite aggregation I tried only supports sorting the results by the values of sources (i.e. the user_id values).
"users": {
"composite": {
"size": 200,
"sources": [
{ "time": { "terms": { "field": "timestamp", "order": "desc" } } },
{ "user_ids": { "terms": { "field": "user_id" } } }
]
I've also attempted to include the timestamp
in sources
, but I don't think this is appropriate either. This does produce a sorted list of results in each query, but my understanding is that due to the nature of composite aggregations not looking at every document, the results are simply sorted locally (i.e. the results are sorted, but that doesn't mean the first bucket I receive for the first page actually contains the most recent timestamp in the entire index). Additionally, I will receive "duplicate" user_id results. Each unique timestamp + user_id pair will appear in my composite buckets response, meaning I'd need to keep track of previously displayed user_ids in my application.
Is there any way to sort my aggregation results by timestamp
and support pagination without maintaining some state in my application? I want to avoid querying every single user_id
+ their max timestamp
and sorting in the application.
Any advice would be greatly appreciated.