Ordering terms aggregation based on Pipeline metric


(Nick Pentreath) #1

Hi,

I'm looking into the new pipeline aggregations for computing things like ratios per term bucket (e.g. click-through rate). I can compute the CTR per item using something like this:

q = {
    "query": {
        "match_all" : {}
    },
    "aggs" : {
        "ctr" : {
            "terms" : {
                "field" : "item_id",
                "size"  : 10,
            },
            "aggs": {
                "numer": {
                  "filter": {
                    "term": {
                      "event": "click"
                    }
                  }
                },
                "denom": {
                  "filter": {
                    "term": {
                      "event": "impression"
                    }
                  }
                },
                "ratio": {
                    "bucket_script": {
                        "buckets_path": {
                          "numer_total": "numer>_count",
                          "denom_total": "denom>_count"
                        },
                        "script": "numer_total / denom_total"
                    }
                }
            }
        }
    }
}

Now, say I wanted to rank results by CTR, I try this:

q = {
    "query": {
        "match_all" : {}
    },
    "aggs" : {
        "ctr" : {
            "terms" : {
                "field" : "item_id",
                "size"  : 10,
                "order": {
                    "ratio" : "desc" 
                }
            },
            "aggs": {
                "numer": {
                  "filter": {
                    "term": {
                      "event": "click"
                    }
                  }
                },
                "denom": {
                  "filter": {
                    "term": {
                      "event": "impression"
                    }
                  }
                },
                "ratio": {
                    "bucket_script": {
                        "buckets_path": {
                          "numer_total": "numer>_count",
                          "denom_total": "denom>_count"
                        },
                        "script": "numer_total / denom_total"
                    }
                }
            }
        }
    }
}

But it doesn't seem that the ratio aggregation can be used for sorting, even though it is a single-valued metric?

I get this error:

RemoteTransportException[[Madame Menace][127.0.0.1:9300][indices:data/read/search[phase/query]]]; nested: AggregationExecutionException[Invalid term-aggregator order path [ratio]. Unknown aggregation [ratio]];
Caused by: AggregationExecutionException[Invalid term-aggregator order path [ratio]. Unknown aggregation [ratio]]
	at org.elasticsearch.search.aggregations.support.AggregationPath.validate(AggregationPath.java:293)
	at org.elasticsearch.search.aggregations.bucket.terms.InternalOrder.validate(InternalOrder.java:145)
	at org.elasticsearch.search.aggregations.bucket.terms.InternalOrder.validate(InternalOrder.java:138)
	at org.elasticsearch.search.aggregations.bucket.terms.TermsAggregator.<init>(TermsAggregator.java:143)
...

Is this not supported? Or is there another way to achieve the same result? I previously tried to use the result of a Scripted Metric Agg to sort, but that is not supported either.


(Colin Goodheart-Smithe) #2

In order to be able to use an aggregation for sorting the terms aggregation it must currently be a numeric metric aggregation. Pipeline Aggregations are a different family of aggregations than the Metric Aggregations and cannot be used for sorting. You can't sort using a Pipeline aggregation because they are only executed in the reduce phase on the coordinating node and so you do not have the information on the shard in order to be able to sort the shards buckets to pick the top N to send to the coordinating node.

The Scripted Metric Aggregation is part of the Metric Aggregation family but is not a Numeric Metric Aggregation since it returns an arbitrary object. However, there is an issue open in the Elasticsearch repo to add the ability to sort by an attribute of the Scripted Metric Aggregation: https://github.com/elastic/elasticsearch/issues/8486


(Nick Pentreath) #3

Thanks - ok that makes sense if the pipelines are done only on the
coordinating node.

I did come across that issue but it seemed like it wasn't progressing. What
are the major blockers remaining there (I see
https://github.com/elastic/elasticsearch/pull/8421 is done but
https://github.com/elastic/elasticsearch/issues/8434 is outstanding /
closed)?

It would seem on the surface that using getProperty to retrieve some value
of the Scripted Metric Agg to sort on should be do-able?


(Colin Goodheart-Smithe) #4

Unfortunately it's not just a case of implementing the getProperty method to retrieve values. Currently the comparator that sorts the terms buckets does not use the getProperty method to sort, which is why it is limited to only Numeric Metric Aggregations since these have the value() or value(String) methods it currently uses. Moving to use the getProperty method is not a trivial change as both the comparator and probably some of the AggregationPath parsing code will need changing to allow for more complex paths within an aggregation.

There aren't any blocker issues stopping this from progressing, it just needs someone to have the time to implement it.


(Nick Pentreath) #5

Ok. Out of interest - I have a decent understanding of USING aggregations
but haven't delved into the code within ES much. Where would be a good
place to start understanding the aggregation path parsing and sorting code,
to better understand the work involved for this change?


(Laurent Quérel) #6

The inability to sort the result by a scripted metric is major blocker to transform ES to usable analytics engine. IMO it's a big limitation to not support something like 'select sum(x)/sum(y) as ratio from table group by term order by ratio'.
Is there an ETA for the support of this feature?


(Laurent Quérel) #7

Here an attempt to fix this issue https://github.com/elastic/elasticsearch/pull/15718 specifically for scripted metrics returning a number.

Because it's a vacation period, we probably need to wait a little before bit to get some feedback and may be an approval.


(Gchen) #8

+1

It will be very helpful if ES can support this feature!


(Buonzz) #9

+1
I terribly need this feature.


(Ben) #10

The inability to sort the result by a scripted metric is major blocker to transform ES to usable analytics engine.

Couldn't agree more. I have migrated a whole app from SolR to ES, thinking that sorting and paginating aggregations would work out of the box.

These 2 missing features require hacking the query to fetch 9999 results per aggregation, sort them manually, and slice the desired offset / limit. Add the pain of sorting on different types (aggregations can be made of dates, integers, strings, ...)

Please, implement this and ES will be the more powerful BI engine ever !


(dullian) #11

+1
considering you have to get top n ratio , the results may be wrong if the size in your terms is too small . but a large size number like 10,000 would hurt performance .


(system) #12