How to use cummulative sum correctly for a machine learning job?

Hi all,

I want to create a ml job which analyses the trend in a cumulative value. So, I created the following aggregation which calculates the sum of a field every day, and then calculates the cumulative sum with the values of previous days.

  1. The aggregation which I created looks like the following:
GET /<index>/_search
{
  "size": 0, 
  "aggs": {
    "entity": {
      "terms": {
        "field": "term.keyword",
        "size": 1
      },
      "aggs": {
        "daily_sum": {
          "date_histogram": {
            "field": "@timestamp",
            "calendar_interval": "day"
          },
          
          "aggs": {
            "cumulative_sum": {
              "cumulative_sum": {
                "buckets_path": "daily_sum"
              }
            },
            "daily_sum": {
              "sum": {
                "field": "numeric_field"
              }
            }
          }
        }
      }
    }
  }
}

Which outputs the following:

      "buckets": [
        {
          "key": "[per term.keyword] field value",
          "doc_count": 216,
          "daily_sum": {
            "buckets": [ 
             {
                "key_as_string": "2024-01-05T00:00:00.000Z",
                "key": 1704412800000,
                "doc_count": 9,
                "daily_sum": {
                  "value": 6420
                },
                "cumulative_sum": {
                  "value": 6420
                }
              },
              {
                "key_as_string": "2024-01-06T00:00:00.000Z",
                "key": 1704499200000,
                "doc_count": 1,
                "daily_sum": {
                  "value": 0
                },
                "cumulative_sum": {
                  "value": 6420
                }
              },
              {
                "key_as_string": "2024-01-07T00:00:00.000Z",
                "key": 1704585600000,
                "doc_count": 1,
                "daily_sum": {
                  "value": 0
                },
                "cumulative_sum": {
                  "value": 6420
                }
              },
              {
                "key_as_string": "2024-01-08T00:00:00.000Z",
                "key": 1704672000000,
                "doc_count": 12,
                "daily_sum": {
                  "value": 1922.75
                },
                "cumulative_sum": {
                  "value": 8342,75
                }
              }
            ]
          }
       }

As you can see, not all dates have a value for the field so the field 'daily_sum' is empty and the field cumulative_sum stays the same (As it should)

Now I want to use this data as input for a ML job where the field cumulative sum is used by the detector and the field (or the value) of key_as_string is used as date field for the time field for the ml job.

To use this as input for a ml job it needs to be a datafeed, but whenever I try to place this aggregation in it, elastic throws the following error:

Date histogram must have nested max aggregation for time_field [@timestamp]

Now I tried doing that, but if I do so, on the dates that there is no value for 'daily_sum' there is also no value for max(@timestamp) as there are no records. But still I want to have the cumulative sum of that day as a value in the ML job. Because now the ML job has empty data points and the line visualization has interruptions making the forecast also not representable.

Now I have some ideas how to fix, but I am not sure how I can accomplish them:

  1. Make the field 'key_as_string' the date field for the ML job and data feed
  2. Give the ML job the configuration in case of no data, use the value of the last bucket.
  3. Change the aggregation?

Can anyone help me how to fix this issue?

With kind regards,
Rick

1 Like

Hi,

There are examples of configuring a datafeed with aggregations at Aggregating data for faster performance | Machine Learning in the Elastic Stack [8.12] | Elastic

Take a look at the example using the derivative aggregation. The derivative agg is a pipeline aggregations the same as cumulative sum, I think you can adapt that example to use the cumulative sum agg.

1 Like

Hi David,

Thanks for your help. I did find that example as well, but that has the nested aggregation max(@timestamp) in it, and that is the part which is giving me problems. If at the interval dates set by the histogram no records are available, the field max(@timestamp) will return null.

The ml graph uses that field as the date field and because it is null, it won't be taken into account even though the value of cumulative sum (which does have a value) is still relevant for the ml model.

Thanks,
Rick

There has to be an aggregation for @timestamp otherwise the data that comes out of the aggregation and is sent to the ML has no time stamp, and anomaly detection jobs need a time field.

If you have sparse data then perhaps consider increasing the bucket_span and the interval of the date histogram aggregation to avoid empty buckets. Those two time intervals should match as well. In other words, don't make your bucket_span 30 mins and your date histogram interval 1 minute.

1 Like

Thanks for your response Rich,

Yes that is understandable, the aggregation does return a datefield because of the date histogram, except that field is stored in key and key_as_string fields a shown in the output, i want to use those dates in de ml job as that is the right interval. How can I achieve that?

Maybe @dkyle can give you a more technical reason why key or key_as_string field can't be used. Maybe it's because the value isn't guaranteed to be the last timestamp value in the bucket? That's what the documentation seems to indicate.

I can't see a way to use the histogram key as the date, at least not without a code change. The parsing code throws and error if there isn't a max agg on the time field.

This is the first example I've come across where you can have a valid value (the cumulative sum) in an empty bucket. I can't think of a workaround other than to create a dummy record with the timestamp field in every date histogram interval

Hi David and Rich,

Thanks for your responses, that indeed seems to be the reason why it isn't possible to get this to work. I could imagine that this would also be a problem for the derivative agregation.

So the goal I want to achieve is to be able to calculate the daily sum of a term and then be able to predict what the cumulative sum will be for the next 7/14 days.

Would there be another way to calculate this? I was thinking of pre-aggregating this data using a rollup or transform but they don't support cumulatve sum either, now I am thinking of creating a watcher which has a time trigger, aggregates the data and has index as action. But since your last response I feel like that will run into the same issue here.

Creating a dummy record won't be an option I believe as the amount of terms, per nature of the data, keeps adding up and we would need to keep adding new records to the 'dummy record creator'.

With kind regards,
Rick van de Vaart

For anyone with the same question:

For me the solution was creating a watcher which pre-aggregates the data and stores it in a separate index. The watcher get triggered by a schedule, the query and aggregation get executed and the results are then stored in an index using the index payload action. Using a transform script in the watcher it can create a different document for every bucket in the aggregation. Then the ML job uses that index to create its model from.

Thanks for the update @Rick_V

Did you consider using a transform to pre-aggregate the data? A transform runs on a schedule, queries and aggregates data then indexes it, this appears very similar to your Watcher solution. Was there a reason why you couldn't use a transform

Thanks

Thanks for the suggestion @dkyle

Yeah it is indeed the same solution as a transform. I considered both a rollup and a transform, but both don't support the cumulative sum metric, so those weren't an option unfortunately.

don't support the cumulative sum metric

Ah thank you.