Sum of duration field of max per group in Elasticsearch

I would like to create a visualizer by summing up duration field after retrieving max id per group in Elasticsearch. For example:

Data is:

id workflow sid duration
1 A x1 1m
1 A x2 2m
2 A x1 2m
2 A x2 3m
1 B y1 1m
1 B y2 2m
2 B y1 2m
2 B y2 3m
3 B y1 4m
3 B y2 2m

Given the table below, expected returned data as follows, which is max of id per workflow and sum up the duration.

id workflow total
2 A 5m
3 B 6m

I'm new to Elasticsearch query and Kibana. Appreciate it if you can provide a pointer how to resolve my problem statement.

{
  "size": 0,
  "aggs": {
    "my-bucket": {
      "terms": {
        "field": "workflow"
      },
      "aggs": {
        "max_id": {
          "max": {
            "field": "id"
          }
        }
      }
    }
  }
}

I have the search query above with expected bucket of workflow and max id #. How to use the max id # to retrieve the sid and sum up the duration.

Hi @carollyl

What version are you on?

Apologies but looking at your example output I do not see the logic / math how you come up with results ... can you show the logic for each row in the result?

Is there supposed to be a row for id 1?

I'm using v 7.17.0.

My search query can only return the id and workflow. I'm looking for advice how to sum up the duration given the id and workflow.

  "aggregations" : {
    "my-bucket" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "A",
          "doc_count" : 2,
          "max_id" : {
            "value" : 2
          }
        },
        {
          "key" : "B",
          "doc_count" : 2,
          "max_id" : {
            "value" : 3
          }
        }
      ]
    }
  }

Apologies but before we try to construct the query we need to under the logic / aggregation / equation you want ... How did you arrive at the please show equation pseudo code etc...I can not figure out what you want ..

Also should there be a row for id: 1
:

id workflow total logic
2 A 5m Equation for this
3 B 6m Equation for this

Sorry for the confusion. The table above was just an illustration that I'd like to achieve. My current equation can only return the max id and bucketed workflow. See below. Instead of id, it is actually the max_id based on my query.

max_id workflow
2 A
3 B

And, I'd like to get the sum of duration for each of the bucketed workflow. For example, workflow A has sum of duration from x1 and x2 (yield 5m), whereas workflow B has sum of duration from y1 and y2 (yield 6m). This is exactly the problem that I couldn't resolve.

My query:

{
  "size": 0,
  "aggs": {
    "my-bucket": {
      "terms": {
        "field": "workflow"
      },
      "aggs": {
        "max_id": {
          "max": {
            "field": "id"
          }
        }
      }
    }
  }
}

Result from above query:

  "aggregations" : {
    "my-bucket" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "A",
          "doc_count" : 2,
          "max_id" : {
            "value" : 2
          }
        },
        {
          "key" : "B",
          "doc_count" : 2,
          "max_id" : {
            "value" : 3
          }
        }
      ]
    }
  }

The question is, how to retrieve the sum of the duration per bucketed workflow per max id?

Exactly what do you mean by bucketed workflow

What is your top-level aggregation?

  1. id or
  2. workflow
  3. id and then sub-aggregation workflow

1,2, or 3?

I ask because

The sentence above says A) below but I think you might mean B)

A) The sum of duration on Workflow A = 1m (Row 1) + 2m (Row 2) + 1m (Row 3) + 3m (Row 4) = 7m

B) The sume of duration Workflow = Max duration of x1 = 2m (from row 3) + Max duration of x2 = 3m (from row 5) which = sum of 2 + 3 = 5

Which do you want A) or B) ?

Is _id part of the aggregation or not?

If you could fill in your results table with the equation like this above perhaps I can help.

Hi @carollyl

if B) is what you want
The trick is using the max then sum_bucket

This query for each workflow finds the max of each sid then adds them together.

This is based on the data in your original table.

GET discuss-duration/_search
{
  "size": 0,
  "aggs": {
    "workflow": {
      "terms": {
        "field": "workflow"
      },
      "aggs": {
        "sid": {
          "terms": {
            "field": "sid"
          },
          "aggs": {
            "max_sid": {
              "max": {
                "field": "duration"
              }
            }
          }
        },
        "sum_of_max": {
          "sum_bucket": {
            "buckets_path": "sid>max_sid"
          }
        }
      }
    }
  }
}

Results

{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 10,
      "relation": "eq"
    },
    "max_score": null,
    "hits": []
  },
  "aggregations": {
    "workflow": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "B",
          "doc_count": 6,
          "sid": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [
              {
                "key": "y1",
                "doc_count": 3,
                "max_sid": {
                  "value": 4
                }
              },
              {
                "key": "y2",
                "doc_count": 3,
                "max_sid": {
                  "value": 3
                }
              }
            ]
          },
          "sum_of_max": {
            "value": 7
          }
        },
        {
          "key": "A",
          "doc_count": 4,
          "sid": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [
              {
                "key": "x1",
                "doc_count": 2,
                "max_sid": {
                  "value": 2
                }
              },
              {
                "key": "x2",
                "doc_count": 2,
                "max_sid": {
                  "value": 3
                }
              }
            ]
          },
          "sum_of_max": {
            "value": 5
          }
        }
      ]
    }
  }
}

Thanks for explaining your expectation on equation. This is what I'd like to achieve.

  1. Identify the max id per workflow
  2. Sum up the duration per workflow based on max id

For example,
Max id of workflow A is id 2. Given id 2, x1 (row3) = 2m, x2 (row 4) = 3m, sum of duration is 2+3=5.
Max id of workflow B is id 3. Given id 3, y1 (row9) = 4m, y2 (row 10) = 2m, sum of duration is 4+2=6.

| max_id | workflow | sum_duration | equation                                                  |
|--------|----------|--------------|-----------------------------------------------------------|
| 2      | A        | 5m           | Max(id) of workflow A is 2. Sum of duration is 5m (x1+x2) |
| 3      | B        | 6m           | Max(id) of workflow B is 3. Sum of duration is 6m (y1+y2) |

I'd like to see the aggregated results as follow:

  • Workflow A = 5m
  • Workflow B = 6m

Ok thanks the equations are much clearer BUT my first thought is that is much harder and may require some advanced scripting (which I am not an expert at), I will try to take a look tomorrow.

It would seem that you want to Use the max id per workflow as a filter / selector I do not know how to do that off the top of my head.... you can not have sub aggregations of max which seems like what you will need to do..

It's late I will need to take a look tomorrow perhaps someone else may look in the meantime.

Hi @carollyl

I have not got it solved yet but here is where I am

Input Data (I changed this because otherwise I could not test false / incorrect solutions)

id,workflow,sid,duration
1,A,x1,1
1,A,x2,6
2,A,x1,2
2,A,x2,3
1,B,y1,1
1,B,y2,2
2,B,y1,8
2,B,y2,3
3,B,y1,4
3,B,y2,2

So now with This Query

GET discuss-duration/_search
{
  "size": 0,
  "aggs": {
    "workflow": {
      "terms": {
        "field": "workflow"
      },
      "aggs": {
        "ids": {
          "terms": {
            "field": "id"
          },
          "aggs": {
            "sum_duration": {
              "sum": {
                "field": "duration"
              }
            }
          }
        }
      }
    }
  }
}

I get these results which includes the correct results but does not isolate them yets

# Results
{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 10,
      "relation": "eq"
    },
    "max_score": null,
    "hits": []
  },
  "aggregations": {
    "workflow": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "B",
          "doc_count": 6,
          "ids": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [
              {
                "key": 1,
                "doc_count": 2,
                "sum_duration": {
                  "value": 3
                }
              },
              {
                "key": 2,
                "doc_count": 2,
                "sum_duration": {
                  "value": 11 <!--- Changed the Data because if I do max_duration this would give the wrong answer
                }
              },
              {
                "key": 3, <!- Max id for workflow B
                "doc_count": 2,
                "sum_duration": {
                  "value": 6 <!--- Correct 
                }
              }
            ]
          }
        },
        {
          "key": "A",
          "doc_count": 4,
          "ids": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [
              {
                "key": 1,
                "doc_count": 2,
                "sum_duration": {
                  "value": 7
                }
              },
              {
                "key": 2, <!- Max id for workflow A
                "doc_count": 2,
                "sum_duration": {
                  "value": 5 <!--- Correct 
                }
              }
            ]
          }
        }
      ]
    }
  }
}

Now I will work on limiting the result ... I am not sure how I am looking at some of the pipeline aggregations...

@carollyl I think I have it now!!!

So this query does a
Terms on workflow
-- Terms on id and sort and max id and only return 1 row (which will be the max)
-- Sums on duration

GET discuss-duration/_search
{
  "size": 0,
  "aggs": {
    "workflow": {
      "terms": {
        "field": "workflow"
      },
      "aggs": {
        "ids": {
          "terms": {
            "field": "id",
            "order": { "max_id": "desc" },
            "size": 1
          },
          "aggs": {
            "max_id": {
              "max": {
                "field": "id"
              }
            },
            "sum_duration": {
              "sum": {
                "field": "duration"
              }
            }
          }
        }
      }
    }
  }
}
# Results
{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 10,
      "relation": "eq"
    },
    "max_score": null,
    "hits": []
  },
  "aggregations": {
    "workflow": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "B",
          "doc_count": 6,
          "ids": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 4,
            "buckets": [
              {
                "key": 3,
                "doc_count": 2,
                "sum_duration": {
                  "value": 6
                },
                "max_id": {
                  "value": 3
                }
              }
            ]
          }
        },
        {
          "key": "A",
          "doc_count": 4,
          "ids": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 2,
            "buckets": [
              {
                "key": 2,
                "doc_count": 2,
                "sum_duration": {
                  "value": 5
                },
                "max_id": {
                  "value": 2
                }
              }
            ]
          }
        }
      ]
    }
  }
}
1 Like

And now the Table (this is 8.9) but I think you can do it in 7.17

Rows
Top Value of workflow
Max value of id Trickey is really Top 1 Values of id Ranked by Custom Max Value of id (This custom rank function might not be in 7.17)
Then Metric is just sum of duration

@stephenb This is cool, thank you so much! You save my many days! Next, Iā€™d like to create visualizers using the returned data.

I don't know what you mean by a visualizer... can you provide more details.

I showed a table ... what are you looking for?

Hi @stephenb , I'd like to build 2 dimension visualizer like follows. You helped me with the aggregated results (duration with workflow*). Is it possible to build the visualizer with breakdown by y-axis (which is release*).

I'd like to clarify, is it necessary to ingest the aggregated results into ES again in order to build visualizer? I'm confused with the records of the screenshot below.

Hi @carollyl

Those are just the field names for my sample data, had to make my own sample data based on what you provided.

If those are in separate indices that will not work .... if you can put them in the same index perhaps

So here is my test data set.
The first row is the field names
workflow, release,sid are all keywords
id, duration are long

release,id,workflow,sid,duration
1.0,1,A,x1,1
1.0,1,A,x2,6
1.0,2,A,x1,2
1.0,2,A,x2,3
1.0,1,B,y1,1
1.0,1,B,y2,2
1.0,2,B,y1,8
1.0,2,B,y2,3
1.0,3,B,y1,4
1.0,3,B,y2,2
1.1,1,A,x1,1
1.1,1,A,x2,6
1.1,2,A,x1,2
1.1,2,A,x2,4
1.2,1,C,y1,3
1.2,1,C,y2,2
1.2,2,C,y1,8
1.2,2,C,y2,3
1.2,4,C,y1,9
1.2,4,C,y2,7

You can create a table by just adding the Release Row as the first row in the Rows fields

Unfortunatly the Heat Map does not work because you can not apply the max id selection (I assume that requirement still exists) because that level of filtering is not available on that Visuzilaton.

If you want to use the heat map you would need to pre-filter / aggregate the data.

Is it possible to use Vega to address my use case? And yes, the max id criteria still hold.

Yes probably, but I can not help you with Vega

You might look at using a transform.

Think pivot table

1 Like