Pipeline aggregation: full histogram of an aggregation


(Rodrigo Rezende) #1

Hi,

I want to estimate the distribution of an aggregation and I wonder if is there any way to perform the parent aggregation on all buckets (thing_id) and then the percentiles over the sum so I can estimate the full histogram.

For example, I could do the following:

GET /data/_search
{
    "size": 0,
    "aggs" : {
        "ActionsByThing" : {
            "terms" : {
                "field" : "thing_id"
            },
            "aggs": {
                "NumberActions": {
                    "sum": {
                        "field": "nActions"
                    }
                }
            }
        },
        "PercentileOfNumberActions": {
            "percentiles_bucket": {
                "buckets_path": "ActionsByThing>NumberActions", 
                "percents": [ 1.0, 2.5, 5.0, 10.0, 25.0, 50.0, 75.0, 90.0, 95.0, 97.5, 99.0] 
            }
        }
    }
}

But that represents only the TOP K thing_id, I'm looking for the overall distribution including the tail.

Even if I have K very large, it will be still biased.
Even if the cardinality of thing_id is not that large and I can afford to have K > |thing_id|, the aggregation response will return ActionsByThing and I really don't need it, I just care about PercentileOfNumberActions.

Is there a way to say to ES collect all ActionByThing:thing_id just for the pipeline aggregation, but never return the ActionByThing with some hint? Maybe that would optimize things internally to consume less memory.

Any other approach?


(Rodrigo Rezende) #2

is it supported?


(Krzysztof) #3

hi,
please take a look at those pages:


https://www.elastic.co/guide/en/logstash/current/plugins-filters-aggregate.html
As I believe your example should be quite easy to cover by above filter-plugin.

Please note that there is a possibility to put Ruby code inside. So you are able to do a lot of things. Please remeber also that you should use only one thread (one worker). In multithreading (as you can guess) case there is a weird behaviour.


(Rodrigo Rezende) #4

I appreciate your reply, but this doesn't answer my question. This has nothing to do with data ingestion / logstash. It's about ES aggregation queries.


(Rodrigo Rezende) #5

I'm assuming now this is not supported yet and requires a new feature. Hence, opened https://github.com/elastic/elasticsearch/issues/21962


(system) #6

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.