Pipeline Aggregations with buckets_path more than two


(Ruslan Didyk) #1

Hi all, I'm getting an error when I'm trying to use Pipeline Aggregation with bucket_path more than one (I.e.: "buckets_path" : "aggOne>aggTwo>aggThree"). For example I have query with 'terms' for two fields and some aggregations for those groups, and I want to get aggregation result from those aggregations.
Here is my simple mappings: (I use Elasticsearch 5.1.1)

Author (name=text, gender=text, age=integer)
Book (title=text, pages=integer, _parent=book)

Query below with 'terms' for only one field and aggregation works fine:

GET author/author/_search

{
    "size": 0,
    "aggs" : {
        "gender" : {
            "terms" : {
                "field" : "gender.keyword"
            },
            "aggs" : {
                "avg" : {
                    "avg" : {
                        "field" : "age"
                    }
                }
            }
        },
        "max" : {
            "max_bucket" : {
                "buckets_path" : "gender>avg"
            }
        }
    }
}

But for query with 'terms' for two fields and aggregation shows an error:

GET author/author/_search

{
   "size":0,
   "aggs":{
      "gender":{
         "terms":{
            "field":"gender.keyword"
         },
         "aggs":{
            "age":{
               "terms":{
                  "field":"age"
               },
               "aggs":{
                  "avg":{
                     "avg":{
                        "field":"age"
                     }
                  }
               }
            }
         }
      },
      "max":{
         "max_bucket":{
            "buckets_path":"gender>age>avg"
         }
      }
   }
}

Error message:

{
   "error": {
      "root_cause": [],
      "type": "reduce_search_phase_exception",
      "reason": "[reduce] ",
      "phase": "fetch",
      "grouped": true,
      "failed_shards": [],
      "caused_by": {
         "type": "aggregation_execution_exception",
         "reason": "buckets_path must reference either a number value or a single value numeric metric aggregation, got: java.lang.Object[]"
      }
   },
   "status": 503
}

Do you have any ideas how to solve this issue? Or Pipeline Aggregations do not support this case? Thank's a lot for any suggestions.

P.S: Actually I've found some workaround how to solve this, but it doesn't look very nice.
Here it is:

{  
   "size":0,
   "aggs":{  
      "gender":{  
         "terms":{  
            "field":"gender.keyword"
         },
         "aggs":{  
            "age":{  
               "terms":{  
                  "field":"age"
               },
               "aggs":{  
                 "avg":{  
                     "avg":{  
                        "field":"age"
                     }
                  }
               }
            },
            "max_avg":{  
               "max_bucket":{  
                  "buckets_path":"age>avg"
               }
            }
         }
      },
      "max":{  
         "max_bucket":{  
            "buckets_path":"gender>max_avg"
         }
      }
   }
}

I've added one more Pipeline Aggregation to aggregate max average for each 'age' groups, and then my main Pipeline Aggregation aggregates max from aggregated values.
Response look's like this:

"aggregations": {
      "gender": {
         "buckets": [
            {
               "key": "male",
               "doc_count": 4,
               "age": {
                  "buckets": [
                     {
                        "key": 80,
                        "doc_count": 2,
                        "avg": {
                           "value": 80
                        }
                     },
                     {
                        "key": 22,
                        "doc_count": 1,
                        "avg": {
                           "value": 22
                        }
                     },
                     {
                        "key": 40,
                        "doc_count": 1,
                        "avg": {
                           "value": 40
                        }
                     }
                  ]
               },
               "max": {
                  "value": 80,
                  "keys": [
                     "80"
                  ]
               }
            },
            {
               "key": "female",
               "doc_count": 2,
               "age": {
                  "buckets": [
                     {
                        "key": 40,
                        "doc_count": 2,
                        "avg": {
                           "value": 40
                        }
                     }
                  ]
               },
               "max": {
                  "value": 40,
                  "keys": [
                     "40"
                  ]
               }
            }
         ]
      },
      "max": {
         "value": 80,
         "keys": [
            "male"
         ]
      }
   }

(Mark Harwood) #2

Is this just a theoretical exercise? I can't see what business problem you're trying to solve here - the same choice of field being used in terms agg and avg aggs doesn't make sense.


(Ruslan Didyk) #3

Mappings and queries that I provided are just an example. Sorry, maybe this wasn't the best case:)
Buts it's not theoretical exercise.

What I'm trying to do is to implement service with OData (http://www.odata.org/) specification, and it supports such kind of queries.
So there could be case when someone wants to group by multiple fields, aggregate some other fields values and then get final aggregated value from those groups.

Query example from OData:

GET ~/Sales?$apply=groupby((Time),aggregate(Amount with sum as Total))/aggregate(Total with average as DailyAverage)

So I wonder, whether Pipeline Aggregation work's fine or this case is discussable?


(Mark Harwood) #4

So if I'm reading the example right it is trying to do some kind of standard deviation where each day's total is expressed as a multiple of the average sales for all days.

To do this you'd need something like this:

DELETE test
POST test/sales
{
	"amount":10,
	"date":"2012-09-21"
}
POST test/sales
{
	"amount":20,
	"date":"2012-09-21"
}
POST test/sales
{
	"amount":5,
	"date":"2012-09-22"
}
POST test/sales
{
	"amount":15,
	"date":"2012-09-22"
}

GET test/sales/_search
{
   "size": 0,
   "aggs": {
	  "by_date": {
		 "date_histogram": {
			"field": "date",
			"interval": "day"
		 },
		 "aggs": {
			"Total": {
			   "sum": {
				  "field": "amount"
			   }
			},
			"normalized_sales": {
			   "bucket_script": {
				  "buckets_path": {
					 "dailySalesTotal": "Total",
					 //!!!! THIS DOES NOT WORK.... !!!!!
					 "allDaysAverage": "../global_stats>avg"
				  },
				  "script": "params.dailySalesTotal / params.allDaysAverage"
			   }
			}
		 }
	  },
	  "global_stats": {
		 "stats_bucket": {
			"buckets_path": "by_date>Total"
		 }
	  }
   }
}

But notice this won't work because inside the array of by_date results we can't refer to a parent-level stat (attempted here using the illegal .. syntax). The docs state:

Paths are relative from the position of the pipeline aggregation; they are not absolute paths, and the path cannot go back "up" the aggregation tree


(system) #5

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.