Pipeline aggregation: full histogram of an aggregation

Rodrigo_Rezende · November 15, 2016, 8:31pm

Hi,

I want to estimate the distribution of an aggregation and I wonder if is there any way to perform the parent aggregation on all buckets (thing_id) and then the percentiles over the sum so I can estimate the full histogram.

For example, I could do the following:

GET /data/_search
{
    "size": 0,
    "aggs" : {
        "ActionsByThing" : {
            "terms" : {
                "field" : "thing_id"
            },
            "aggs": {
                "NumberActions": {
                    "sum": {
                        "field": "nActions"
                    }
                }
            }
        },
        "PercentileOfNumberActions": {
            "percentiles_bucket": {
                "buckets_path": "ActionsByThing>NumberActions", 
                "percents": [ 1.0, 2.5, 5.0, 10.0, 25.0, 50.0, 75.0, 90.0, 95.0, 97.5, 99.0] 
            }
        }
    }
}

But that represents only the TOP K thing_id, I'm looking for the overall distribution including the tail.

Even if I have K very large, it will be still biased.
Even if the cardinality of thing_id is not that large and I can afford to have K > |thing_id|, the aggregation response will return ActionsByThing and I really don't need it, I just care about PercentileOfNumberActions.

Is there a way to say to ES collect all ActionByThing:thing_id just for the pipeline aggregation, but never return the ActionByThing with some hint? Maybe that would optimize things internally to consume less memory.

Any other approach?

Rodrigo_Rezende · November 22, 2016, 8:02pm

is it supported?

krzysztof_pl · December 2, 2016, 6:24pm

hi,
please take a look at those pages:

https://www.elastic.co/guide/en/logstash/current/plugins-filters-aggregate.html
As I believe your example should be quite easy to cover by above filter-plugin.

Please note that there is a possibility to put Ruby code inside. So you are able to do a lot of things. Please remeber also that you should use only one thread (one worker). In multithreading (as you can guess) case there is a weird behaviour.

Rodrigo_Rezende · December 4, 2016, 7:03pm

I appreciate your reply, but this doesn't answer my question. This has nothing to do with data ingestion / logstash. It's about ES aggregation queries.

Rodrigo_Rezende · December 4, 2016, 7:18pm

I'm assuming now this is not supported yet and requires a new feature. Hence, opened https://github.com/elastic/elasticsearch/issues/21962

system · January 1, 2017, 7:19pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Need help with ElasticSearch Pipeline Aggregation query Elasticsearch	1	463	March 14, 2017
Pipeline aggregation for selecting the documents Elasticsearch	2	391	April 27, 2017
Pipelined Histogram Aggregation? Elasticsearch	5	1548	July 5, 2017
Sub-aggregations: percentiles of sum Elasticsearch	4	676	July 6, 2017
Pipeline aggregation with Date histogram doesn't return expected result Elasticsearch	2	365	April 8, 2019

Pipeline aggregation: full histogram of an aggregation

Related topics