Sum of duration field of max per group in Elasticsearch

carollyl · September 4, 2023, 8:36pm

I would like to create a visualizer by summing up duration field after retrieving max id per group in Elasticsearch. For example:

Data is:

id workflow sid duration

1 A x1 1m

1 A x2 2m

2 A x1 2m

2 A x2 3m

1 B y1 1m

1 B y2 2m

2 B y1 2m

2 B y2 3m

3 B y1 4m

3 B y2 2m

Given the table below, expected returned data as follows, which is max of id per workflow and sum up the duration.

id workflow total

2 A 5m

3 B 6m

I'm new to Elasticsearch query and Kibana. Appreciate it if you can provide a pointer how to resolve my problem statement.

carollyl · September 4, 2023, 11:20pm

{
  "size": 0,
  "aggs": {
    "my-bucket": {
      "terms": {
        "field": "workflow"
      },
      "aggs": {
        "max_id": {
          "max": {
            "field": "id"
          }
        }
      }
    }
  }
}

I have the search query above with expected bucket of workflow and max id #. How to use the max id # to retrieve the sid and sum up the duration.

stephenb · September 5, 2023, 12:49am

Hi @carollyl

What version are you on?

Apologies but looking at your example output I do not see the logic / math how you come up with results ... can you show the logic for each row in the result?

Is there supposed to be a row for id 1?

carollyl · September 5, 2023, 2:01am

I'm using v 7.17.0.

My search query can only return the id and workflow. I'm looking for advice how to sum up the duration given the id and workflow.

  "aggregations" : {
    "my-bucket" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "A",
          "doc_count" : 2,
          "max_id" : {
            "value" : 2
          }
        },
        {
          "key" : "B",
          "doc_count" : 2,
          "max_id" : {
            "value" : 3
          }
        }
      ]
    }
  }

stephenb · September 5, 2023, 2:28am

Apologies but before we try to construct the query we need to under the logic / aggregation / equation you want ... How did you arrive at the please show equation pseudo code etc...I can not figure out what you want ..

Also should there be a row for id: 1
:

id workflow total logic

2 A 5m Equation for this

3 B 6m Equation for this

carollyl · September 5, 2023, 2:43am

Sorry for the confusion. The table above was just an illustration that I'd like to achieve. My current equation can only return the max id and bucketed workflow. See below. Instead of id, it is actually the max_id based on my query.

max_id	workflow
2	A
3	B

And, I'd like to get the sum of duration for each of the bucketed workflow. For example, workflow A has sum of duration from x1 and x2 (yield 5m), whereas workflow B has sum of duration from y1 and y2 (yield 6m). This is exactly the problem that I couldn't resolve.

carollyl · September 5, 2023, 2:45am

My query:

{
  "size": 0,
  "aggs": {
    "my-bucket": {
      "terms": {
        "field": "workflow"
      },
      "aggs": {
        "max_id": {
          "max": {
            "field": "id"
          }
        }
      }
    }
  }
}

Result from above query:

  "aggregations" : {
    "my-bucket" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "A",
          "doc_count" : 2,
          "max_id" : {
            "value" : 2
          }
        },
        {
          "key" : "B",
          "doc_count" : 2,
          "max_id" : {
            "value" : 3
          }
        }
      ]
    }
  }

The question is, how to retrieve the sum of the duration per bucketed workflow per max id?

stephenb · September 5, 2023, 2:59am

Exactly what do you mean by bucketed workflow

What is your top-level aggregation?

id or
workflow
id and then sub-aggregation workflow

1,2, or 3?

I ask because

The sentence above says A) below but I think you might mean B)

A) The sum of duration on Workflow A = 1m (Row 1) + 2m (Row 2) + 1m (Row 3) + 3m (Row 4) = 7m

B) The sume of duration Workflow = Max duration of x1 = 2m (from row 3) + Max duration of x2 = 3m (from row 5) which = sum of 2 + 3 = 5

Which do you want A) or B) ?

Is _id part of the aggregation or not?

If you could fill in your results table with the equation like this above perhaps I can help.

stephenb · September 5, 2023, 3:15am

Hi @carollyl

if B) is what you want
The trick is using the max then sum_bucket

This query for each workflow finds the max of each sid then adds them together.

This is based on the data in your original table.

GET discuss-duration/_search
{
  "size": 0,
  "aggs": {
    "workflow": {
      "terms": {
        "field": "workflow"
      },
      "aggs": {
        "sid": {
          "terms": {
            "field": "sid"
          },
          "aggs": {
            "max_sid": {
              "max": {
                "field": "duration"
              }
            }
          }
        },
        "sum_of_max": {
          "sum_bucket": {
            "buckets_path": "sid>max_sid"
          }
        }
      }
    }
  }
}

Results

{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 10,
      "relation": "eq"
    },
    "max_score": null,
    "hits": []
  },
  "aggregations": {
    "workflow": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "B",
          "doc_count": 6,
          "sid": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [
              {
                "key": "y1",
                "doc_count": 3,
                "max_sid": {
                  "value": 4
                }
              },
              {
                "key": "y2",
                "doc_count": 3,
                "max_sid": {
                  "value": 3
                }
              }
            ]
          },
          "sum_of_max": {
            "value": 7
          }
        },
        {
          "key": "A",
          "doc_count": 4,
          "sid": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [
              {
                "key": "x1",
                "doc_count": 2,
                "max_sid": {
                  "value": 2
                }
              },
              {
                "key": "x2",
                "doc_count": 2,
                "max_sid": {
                  "value": 3
                }
              }
            ]
          },
          "sum_of_max": {
            "value": 5
          }
        }
      ]
    }
  }
}

carollyl · September 5, 2023, 4:18am

Thanks for explaining your expectation on equation. This is what I'd like to achieve.

Identify the max id per workflow
Sum up the duration per workflow based on max id

For example,
Max id of workflow A is id 2. Given id 2, x1 (row3) = 2m, x2 (row 4) = 3m, sum of duration is 2+3=5.
Max id of workflow B is id 3. Given id 3, y1 (row9) = 4m, y2 (row 10) = 2m, sum of duration is 4+2=6.

| max_id | workflow | sum_duration | equation                                                  |
|--------|----------|--------------|-----------------------------------------------------------|
| 2      | A        | 5m           | Max(id) of workflow A is 2. Sum of duration is 5m (x1+x2) |
| 3      | B        | 6m           | Max(id) of workflow B is 3. Sum of duration is 6m (y1+y2) |

I'd like to see the aggregated results as follow:

Workflow A = 5m
Workflow B = 6m

stephenb · September 5, 2023, 5:26am

Ok thanks the equations are much clearer BUT my first thought is that is much harder and may require some advanced scripting (which I am not an expert at), I will try to take a look tomorrow.

It would seem that you want to Use the max id per workflow as a filter / selector I do not know how to do that off the top of my head.... you can not have sub aggregations of max which seems like what you will need to do..

It's late I will need to take a look tomorrow perhaps someone else may look in the meantime.

stephenb · September 6, 2023, 4:44am

Hi @carollyl

I have not got it solved yet but here is where I am

Input Data (I changed this because otherwise I could not test false / incorrect solutions)

id,workflow,sid,duration
1,A,x1,1
1,A,x2,6
2,A,x1,2
2,A,x2,3
1,B,y1,1
1,B,y2,2
2,B,y1,8
2,B,y2,3
3,B,y1,4
3,B,y2,2

So now with This Query

GET discuss-duration/_search
{
  "size": 0,
  "aggs": {
    "workflow": {
      "terms": {
        "field": "workflow"
      },
      "aggs": {
        "ids": {
          "terms": {
            "field": "id"
          },
          "aggs": {
            "sum_duration": {
              "sum": {
                "field": "duration"
              }
            }
          }
        }
      }
    }
  }
}

I get these results which includes the correct results but does not isolate them yets

# Results
{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 10,
      "relation": "eq"
    },
    "max_score": null,
    "hits": []
  },
  "aggregations": {
    "workflow": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "B",
          "doc_count": 6,
          "ids": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [
              {
                "key": 1,
                "doc_count": 2,
                "sum_duration": {
                  "value": 3
                }
              },
              {
                "key": 2,
                "doc_count": 2,
                "sum_duration": {
                  "value": 11 <!--- Changed the Data because if I do max_duration this would give the wrong answer
                }
              },
              {
                "key": 3, <!- Max id for workflow B
                "doc_count": 2,
                "sum_duration": {
                  "value": 6 <!--- Correct 
                }
              }
            ]
          }
        },
        {
          "key": "A",
          "doc_count": 4,
          "ids": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [
              {
                "key": 1,
                "doc_count": 2,
                "sum_duration": {
                  "value": 7
                }
              },
              {
                "key": 2, <!- Max id for workflow A
                "doc_count": 2,
                "sum_duration": {
                  "value": 5 <!--- Correct 
                }
              }
            ]
          }
        }
      ]
    }
  }
}

Now I will work on limiting the result ... I am not sure how I am looking at some of the pipeline aggregations...

stephenb · September 6, 2023, 4:51am

@carollyl I think I have it now!!!

So this query does a
Terms on workflow
-- Terms on id and sort and max id and only return 1 row (which will be the max)
-- Sums on duration

GET discuss-duration/_search
{
  "size": 0,
  "aggs": {
    "workflow": {
      "terms": {
        "field": "workflow"
      },
      "aggs": {
        "ids": {
          "terms": {
            "field": "id",
            "order": { "max_id": "desc" },
            "size": 1
          },
          "aggs": {
            "max_id": {
              "max": {
                "field": "id"
              }
            },
            "sum_duration": {
              "sum": {
                "field": "duration"
              }
            }
          }
        }
      }
    }
  }
}

# Results
{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 10,
      "relation": "eq"
    },
    "max_score": null,
    "hits": []
  },
  "aggregations": {
    "workflow": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "B",
          "doc_count": 6,
          "ids": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 4,
            "buckets": [
              {
                "key": 3,
                "doc_count": 2,
                "sum_duration": {
                  "value": 6
                },
                "max_id": {
                  "value": 3
                }
              }
            ]
          }
        },
        {
          "key": "A",
          "doc_count": 4,
          "ids": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 2,
            "buckets": [
              {
                "key": 2,
                "doc_count": 2,
                "sum_duration": {
                  "value": 5
                },
                "max_id": {
                  "value": 2
                }
              }
            ]
          }
        }
      ]
    }
  }
}

stephenb · September 6, 2023, 5:05am

And now the Table (this is 8.9) but I think you can do it in 7.17

Rows
Top Value of workflow
Max value of id Trickey is really Top 1 Values of id Ranked by Custom Max Value of id (This custom rank function might not be in 7.17)
Then Metric is just sum of duration

carollyl · September 8, 2023, 5:06am

@stephenb This is cool, thank you so much! You save my many days! Next, I’d like to create visualizers using the returned data.

stephenb · September 8, 2023, 5:10am

I don't know what you mean by a visualizer... can you provide more details.

I showed a table ... what are you looking for?

carollyl · September 9, 2023, 1:13am

Hi @stephenb , I'd like to build 2 dimension visualizer like follows. You helped me with the aggregated results (duration with workflow*). Is it possible to build the visualizer with breakdown by y-axis (which is release*).

I'd like to clarify, is it necessary to ingest the aggregated results into ES again in order to build visualizer? I'm confused with the records of the screenshot below.

stephenb · September 9, 2023, 2:33am

Hi @carollyl

Those are just the field names for my sample data, had to make my own sample data based on what you provided.

If those are in separate indices that will not work .... if you can put them in the same index perhaps

So here is my test data set.
The first row is the field names
workflow, release,sid are all keywords
id, duration are long

release,id,workflow,sid,duration
1.0,1,A,x1,1
1.0,1,A,x2,6
1.0,2,A,x1,2
1.0,2,A,x2,3
1.0,1,B,y1,1
1.0,1,B,y2,2
1.0,2,B,y1,8
1.0,2,B,y2,3
1.0,3,B,y1,4
1.0,3,B,y2,2
1.1,1,A,x1,1
1.1,1,A,x2,6
1.1,2,A,x1,2
1.1,2,A,x2,4
1.2,1,C,y1,3
1.2,1,C,y2,2
1.2,2,C,y1,8
1.2,2,C,y2,3
1.2,4,C,y1,9
1.2,4,C,y2,7

You can create a table by just adding the Release Row as the first row in the Rows fields

Unfortunatly the Heat Map does not work because you can not apply the max id selection (I assume that requirement still exists) because that level of filtering is not available on that Visuzilaton.

If you want to use the heat map you would need to pre-filter / aggregate the data.

carollyl · September 9, 2023, 4:17am

Is it possible to use Vega to address my use case? And yes, the max id criteria still hold.

stephenb · September 9, 2023, 6:09am

Yes probably, but I can not help you with Vega

You might look at using a transform.

Think pivot table

Topic		Replies	Views
Sum of a field after terms aggregation Elasticsearch	5	237	March 21, 2023
Using Top Hits aggregation for calculating time duration in each grouped field Elasticsearch	5	1255	August 29, 2017
From Elasticsearch aggregations to Kibana dashboard Kibana	2	528	April 7, 2022
[Kibana visualization] max of sum Kibana	4	1432	December 2, 2019
Sum up Aggregation values? Elasticsearch	4	526	July 6, 2017

id	workflow	sid	duration
1	A	x1	1m
1	A	x2	2m
2	A	x1	2m
2	A	x2	3m
1	B	y1	1m
1	B	y2	2m
2	B	y1	2m
2	B	y2	3m
3	B	y1	4m
3	B	y2	2m

id	workflow	sid	duration
1	A	x1	1m
1	A	x2	2m
2	A	x1	2m
2	A	x2	3m
1	B	y1	1m
1	B	y2	2m
2	B	y1	2m
2	B	y2	3m
3	B	y1	4m
3	B	y2	2m

Sum of duration field of max per group in Elasticsearch

Related topics

id	workflow	sid	duration
1	A	x1	1m
1	A	x2	2m
2	A	x1	2m
2	A	x2	3m
1	B	y1	1m
1	B	y2	2m
2	B	y1	2m
2	B	y2	3m
3	B	y1	4m
3	B	y2	2m