How to use cummulative sum correctly for a machine learning job?

Rick_V · February 6, 2024, 4:14pm

Hi all,

I want to create a ml job which analyses the trend in a cumulative value. So, I created the following aggregation which calculates the sum of a field every day, and then calculates the cumulative sum with the values of previous days.

The aggregation which I created looks like the following:

GET /<index>/_search
{
  "size": 0, 
  "aggs": {
    "entity": {
      "terms": {
        "field": "term.keyword",
        "size": 1
      },
      "aggs": {
        "daily_sum": {
          "date_histogram": {
            "field": "@timestamp",
            "calendar_interval": "day"
          },
          
          "aggs": {
            "cumulative_sum": {
              "cumulative_sum": {
                "buckets_path": "daily_sum"
              }
            },
            "daily_sum": {
              "sum": {
                "field": "numeric_field"
              }
            }
          }
        }
      }
    }
  }
}

Which outputs the following:

      "buckets": [
        {
          "key": "[per term.keyword] field value",
          "doc_count": 216,
          "daily_sum": {
            "buckets": [ 
             {
                "key_as_string": "2024-01-05T00:00:00.000Z",
                "key": 1704412800000,
                "doc_count": 9,
                "daily_sum": {
                  "value": 6420
                },
                "cumulative_sum": {
                  "value": 6420
                }
              },
              {
                "key_as_string": "2024-01-06T00:00:00.000Z",
                "key": 1704499200000,
                "doc_count": 1,
                "daily_sum": {
                  "value": 0
                },
                "cumulative_sum": {
                  "value": 6420
                }
              },
              {
                "key_as_string": "2024-01-07T00:00:00.000Z",
                "key": 1704585600000,
                "doc_count": 1,
                "daily_sum": {
                  "value": 0
                },
                "cumulative_sum": {
                  "value": 6420
                }
              },
              {
                "key_as_string": "2024-01-08T00:00:00.000Z",
                "key": 1704672000000,
                "doc_count": 12,
                "daily_sum": {
                  "value": 1922.75
                },
                "cumulative_sum": {
                  "value": 8342,75
                }
              }
            ]
          }
       }

As you can see, not all dates have a value for the field so the field 'daily_sum' is empty and the field cumulative_sum stays the same (As it should)

Now I want to use this data as input for a ML job where the field cumulative sum is used by the detector and the field (or the value) of key_as_string is used as date field for the time field for the ml job.

To use this as input for a ml job it needs to be a datafeed, but whenever I try to place this aggregation in it, elastic throws the following error:

Date histogram must have nested max aggregation for time_field [@timestamp]

Now I tried doing that, but if I do so, on the dates that there is no value for 'daily_sum' there is also no value for max(@timestamp) as there are no records. But still I want to have the cumulative sum of that day as a value in the ML job. Because now the ML job has empty data points and the line visualization has interruptions making the forecast also not representable.

Now I have some ideas how to fix, but I am not sure how I can accomplish them:

Make the field 'key_as_string' the date field for the ML job and data feed
Give the ML job the configuration in case of no data, use the value of the last bucket.
Change the aggregation?

Can anyone help me how to fix this issue?

With kind regards,
Rick

dkyle · February 7, 2024, 10:39am

Hi,

There are examples of configuring a datafeed with aggregations at Aggregating data for faster performance | Machine Learning in the Elastic Stack [8.12] | Elastic

Take a look at the example using the derivative aggregation. The derivative agg is a pipeline aggregations the same as cumulative sum, I think you can adapt that example to use the cumulative sum agg.

Rick_V · February 7, 2024, 1:45pm

Hi David,

Thanks for your help. I did find that example as well, but that has the nested aggregation max(@timestamp) in it, and that is the part which is giving me problems. If at the interval dates set by the histogram no records are available, the field max(@timestamp) will return null.

The ml graph uses that field as the date field and because it is null, it won't be taken into account even though the value of cumulative sum (which does have a value) is still relevant for the ml model.

Thanks,
Rick

richcollier · February 7, 2024, 7:28pm

There has to be an aggregation for @timestamp otherwise the data that comes out of the aggregation and is sent to the ML has no time stamp, and anomaly detection jobs need a time field.

If you have sparse data then perhaps consider increasing the bucket_span and the interval of the date histogram aggregation to avoid empty buckets. Those two time intervals should match as well. In other words, don't make your bucket_span 30 mins and your date histogram interval 1 minute.

Rick_V · February 8, 2024, 8:30am

Thanks for your response Rich,

Yes that is understandable, the aggregation does return a datefield because of the date histogram, except that field is stored in key and key_as_string fields a shown in the output, i want to use those dates in de ml job as that is the right interval. How can I achieve that?

richcollier · February 9, 2024, 6:29pm

Maybe @dkyle can give you a more technical reason why key or key_as_string field can't be used. Maybe it's because the value isn't guaranteed to be the last timestamp value in the bucket? That's what the documentation seems to indicate.

dkyle · February 11, 2024, 1:43pm

I can't see a way to use the histogram key as the date, at least not without a code change. The parsing code throws and error if there isn't a max agg on the time field.

github.com

elastic/elasticsearch/blob/main/x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/datafeed/extractor/aggregation/AggregationToJsonProcessor.java#L317


      
              } else if (key instanceof Double) {
                  return ((Double) key).longValue();
              } else if (key instanceof Long) {
                  return (Long) key;
              } else {
                  throw new IllegalStateException("Histogram key [" + key + "] cannot be converted to a timestamp");
              }
          }
          
          private void processTimeField(Aggregation agg) {
              if (agg instanceof Max == false) {
                  throw new IllegalArgumentException(Messages.getMessage(Messages.DATAFEED_MISSING_MAX_AGGREGATION_FOR_TIME_FIELD, timeField));
              }
          
              long timestamp = (long) ((Max) agg).value();
              keyValuePairs.put(timeField, timestamp);
          }
          
          boolean bucketAggContainsRequiredAgg(MultiBucketsAggregation aggregation) {
              if (fields.contains(aggregation.getName())) {
                  return true;

This is the first example I've come across where you can have a valid value (the cumulative sum) in an empty bucket. I can't think of a workaround other than to create a dummy record with the timestamp field in every date histogram interval

Rick_V · February 12, 2024, 9:55am

Hi David and Rich,

Thanks for your responses, that indeed seems to be the reason why it isn't possible to get this to work. I could imagine that this would also be a problem for the derivative agregation.

So the goal I want to achieve is to be able to calculate the daily sum of a term and then be able to predict what the cumulative sum will be for the next 7/14 days.

Would there be another way to calculate this? I was thinking of pre-aggregating this data using a rollup or transform but they don't support cumulatve sum either, now I am thinking of creating a watcher which has a time trigger, aggregates the data and has index as action. But since your last response I feel like that will run into the same issue here.

Creating a dummy record won't be an option I believe as the amount of terms, per nature of the data, keeps adding up and we would need to keep adding new records to the 'dummy record creator'.

With kind regards,
Rick van de Vaart

Rick_V · February 22, 2024, 8:55am

For anyone with the same question:

For me the solution was creating a watcher which pre-aggregates the data and stores it in a separate index. The watcher get triggered by a schedule, the query and aggregation get executed and the results are then stored in an index using the index payload action. Using a transform script in the watcher it can create a different document for every bucket in the aggregation. Then the ML job uses that index to create its model from.

dkyle · February 22, 2024, 9:28am

Thanks for the update @Rick_V

Did you consider using a transform to pre-aggregate the data? A transform runs on a schedule, queries and aggregates data then indexes it, this appears very similar to your Watcher solution. Was there a reason why you couldn't use a transform

Thanks

Rick_V · February 22, 2024, 9:47am

Thanks for the suggestion @dkyle

Yeah it is indeed the same solution as a transform. I considered both a rollup and a transform, but both don't support the cumulative sum metric, so those weren't an option unfortunately.

dkyle · February 22, 2024, 9:59am

don't support the cumulative sum metric

Ah thank you.

system · March 21, 2024, 10:00am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Cummulative date sum in Kibana Console Kibana	2	201	April 19, 2022
Curious if theres a way to do Cumulative Sum on non Date/Historgram Aggregations Elasticsearch	1	286	March 19, 2019
Cumulative sum query inside nested term aggregation > avg aggregation Elasticsearch	1	381	June 19, 2019
Cumulative Sum Regardless of Time Interval - ES Elasticsearch	5	1669	June 24, 2019
How to create a table with cummlative sum and display only final value for given data histogram Elasticsearch	2	337	March 21, 2019

How to use cummulative sum correctly for a machine learning job?

Related Topics