I’m looking for the fastest and safest way to retrieve all unique values of a field for a given time range, in batches.
My requirements are:
-
Retrieve 100% of unique terms (no missing buckets) -
Support batching / pagination -
Be as fast as possible -
Avoid excessive heap usage
I’ve evaluated two approaches:
1. terms aggregation with include.partition
Example:
"terms": {
"field": "someField.keyword",
"size": 10000,
"include": {
"partition": 0,
"num_partitions": 20
}
}
Iterating partition = 0..N.
I understand that:
-
Each unique term is deterministically hashed into a partition
-
Distribution may appear uneven for small cardinalities
-
Larger cardinalities should distribute more evenly
-
The same term always maps to the same partition across shards
However, this approach requires:
-
Choosing
num_partitionsupfront -
Managing a hard
sizelimit per partition (risk of missing terms) -
Manual orchestration of partitions
-
No cursor/resume mechanism
-
Potentially higher heap usage due to in-memory bucket building
2. Composite aggregation with after_key
This seems to offer:
-
Cursor-based pagination
-
Unlimited buckets
-
Natural batching
-
Lower memory pressure
-
Easy resumability
Question
For the general use case:
Retrieve all unique field values over a time range, at scale, with batching and maximum performance
Is composite aggregation the recommended production approach over terms + partition?
Are there scenarios where terms + partition is preferable?
My primary goal is:
Fast, complete, resumable extraction of unique terms.
Thanks in advance for any guidance.