Hi,
I want to estimate the distribution of an aggregation and I wonder if is there any way to perform the parent aggregation on all buckets (thing_id) and then the percentiles over the sum so I can estimate the full histogram.
For example, I could do the following:
GET /data/_search
{
"size": 0,
"aggs" : {
"ActionsByThing" : {
"terms" : {
"field" : "thing_id"
},
"aggs": {
"NumberActions": {
"sum": {
"field": "nActions"
}
}
}
},
"PercentileOfNumberActions": {
"percentiles_bucket": {
"buckets_path": "ActionsByThing>NumberActions",
"percents": [ 1.0, 2.5, 5.0, 10.0, 25.0, 50.0, 75.0, 90.0, 95.0, 97.5, 99.0]
}
}
}
}
But that represents only the TOP K thing_id, I'm looking for the overall distribution including the tail.
Even if I have K very large, it will be still biased.
Even if the cardinality of thing_id is not that large and I can afford to have K > |thing_id|, the aggregation response will return ActionsByThing and I really don't need it, I just care about PercentileOfNumberActions.
Is there a way to say to ES collect all ActionByThing:thing_id just for the pipeline aggregation, but never return the ActionByThing with some hint? Maybe that would optimize things internally to consume less memory.
Any other approach?