Pipeline aggregation: full histogram of an aggregation

Hi,

I want to estimate the distribution of an aggregation and I wonder if is there any way to perform the parent aggregation on all buckets (thing_id) and then the percentiles over the sum so I can estimate the full histogram.

For example, I could do the following:

GET /data/_search
{
    "size": 0,
    "aggs" : {
        "ActionsByThing" : {
            "terms" : {
                "field" : "thing_id"
            },
            "aggs": {
                "NumberActions": {
                    "sum": {
                        "field": "nActions"
                    }
                }
            }
        },
        "PercentileOfNumberActions": {
            "percentiles_bucket": {
                "buckets_path": "ActionsByThing>NumberActions", 
                "percents": [ 1.0, 2.5, 5.0, 10.0, 25.0, 50.0, 75.0, 90.0, 95.0, 97.5, 99.0] 
            }
        }
    }
}

But that represents only the TOP K thing_id, I'm looking for the overall distribution including the tail.

Even if I have K very large, it will be still biased.
Even if the cardinality of thing_id is not that large and I can afford to have K > |thing_id|, the aggregation response will return ActionsByThing and I really don't need it, I just care about PercentileOfNumberActions.

Is there a way to say to ES collect all ActionByThing:thing_id just for the pipeline aggregation, but never return the ActionByThing with some hint? Maybe that would optimize things internally to consume less memory.

Any other approach?

is it supported?

hi,
please take a look at those pages:


https://www.elastic.co/guide/en/logstash/current/plugins-filters-aggregate.html
As I believe your example should be quite easy to cover by above filter-plugin.

Please note that there is a possibility to put Ruby code inside. So you are able to do a lot of things. Please remeber also that you should use only one thread (one worker). In multithreading (as you can guess) case there is a weird behaviour.

I appreciate your reply, but this doesn't answer my question. This has nothing to do with data ingestion / logstash. It's about ES aggregation queries.

I'm assuming now this is not supported yet and requires a new feature. Hence, opened https://github.com/elastic/elasticsearch/issues/21962

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.