Calculating aggs at each level of subaggregation


(Sri) #1

Hi,

I have event data like this. And I need to calculate aggs at multiple levels of bucket aggregations. I cannot find a way to calculate aggs except at the "leaf level".

{ "id": 1, "field1" : "A1", "field2" : "B11", "field3" : 100, "field4" : 2.0}
{ "id": 2, "field1" : "A1", "field2" : "B21", "field3" : 95, "field4" : 1.8}
{ "id": 3, "field1" : "A1", "field2" : "B11", "field3" : 124, "field4" : 1.6}

{ "id": 4, "field1" : "A2", "field2" : "B12", "field3" : 104, "field4" : 2.1}
{ "id": 5, "field1" : "A2", "field2" : "B22", "field3" : 101, "field4" : 1.7}
{ "id": 6, "field1" : "A2", "field2" : "B22", "field3" : 103, "field4" : 1.4}

In my data set,

  • field3 and field4 can be any numeric value and the values vary with each event.

  • field1 and field2 are used as terms in query.

  • A1 has multiple Bx1 terms. Similarly, A2 has multiple Bx2 terms.

  • I'm interested in averaging fields3 and fields4 by buckets of Bxy terms. I would get 1 avg for each Bxy term (level2 avg).

e.g id 5, 6 would become one level2 avg (based on A2, B22 term combination). Similarly, id 1, 3 would become one level2 avg (based on A1, B11 term combination)

NOTE: I am able to use nested terms aggregations and calculate this level2 avg.

  • Then I need a rollup avg for A1 (level1), that is an avg of the avg calculated at level2.

I'm unable to figure out how to do this rollup avg. I tried using bucket_path, but get "Found two sub aggregation definitions" error as I have a terms and avg aggregation at level1.

  • If I were successful, I would calculate another rollup avg for all Ax (level0), that is an avg of all level1 avg computed.

How can I do this in ES?

Thanks.

The working level2 avg query is below. The attempt at using bucket_path is further below.

GET _search
{
  "query": {
    "match_all": {}
  },
  "aggs": {
    "by_time": {
      "date_histogram": {
        "field": "@timestamp",
        "interval": "day"
      },
      "aggs": {
        "field1_aggs": {
          "terms": {
            "field": "field1.raw",
            "size": 0
          },
          "aggs": {
            "field2_aggs": {
              "terms": {
                "field": "field2.raw",
                "size": 0
              },
              "aggs": {
                "field3_avg": {
                  "avg": {
                    "field": "field3"
                  }
                }
              }
            }
          }
        }
      }
    }
  }
}

Attempt with bucket_path

GET _search
{
  "query": {
    "match_all": {}
  },
  "aggs": {
    "by_time": {
      "date_histogram": {
        "field": "@timestamp",
        "interval": "day"
      },
      "aggs": {
        "field1_aggs": {
          "terms": {
            "field": "field1.raw",
            "size": 0
          },
          "aggs": {
            "field2_aggs": {
              "terms": {
                "field": "field2.raw",
                "size": 0
              },
              "aggs": {
                "field3_avg": {
                  "avg": {
                    "field": "field3"
                  }
                }
              }
            }
          },
          "aggs": {
            "field1_avg": {
              "avg": {
                "buckets_path": { "field2_aggs>field3_avg.avg" }
              }
            }
          }
        }
      }
    }
  }
}

(Sri) #2

Hi,

I've still not succeeded to find a way - does ES even allow such multi-level computations in a single query?

Or, do I have to index the result from the first query that calculates level2 result and then issue a second query to calculate the level 1 result?

Thx


(Sri) #3

Replying to my own messages :slight_smile:...

It seems I need to use pipeline aggregations vs. "normal" aggregations. The official documentation called pipeline aggregations experimental, so I had not paid attention earlier, but they are very relevant for my use-case as described in this blog https://www.elastic.co/blog/out-of-this-world-aggregations

However, I still cannot get the syntax right; regardless of where I add the the avg pipeline aggregation to the query, I get errors with this

          "field1_avg": {
             "avg_bucket": {
                "buckets_path": "field2_aggs>field3_avg" 
            }

Or by adding ".value" to the buckets_path

"buckets_path": "field2_aggs>field3_avg.value"

Error:

buckets_path must reference either a number value or a single value numeric metric aggregation, got: java.lang.Object[]


(Sri) #4

This is solved now. For benefit of anyone else that looks at this thread : The problem was that the pipeline aggregation needed to be added at the sibling level of the field2_aggs (and I was originally adding it one level up in the query).

I could then also add another pipeline agg one level up, at the field1_aggs level, that used the output from the first pipeline agg to compute the top-most level agg.


(Swelltt0576) #5

dear sir,
I have the same problem now and have some confusing with you answer.
can you give me any code?

thanks very much


(system) #6