Hi all,
I want to create a ml job which analyses the trend in a cumulative value. So, I created the following aggregation which calculates the sum of a field every day, and then calculates the cumulative sum with the values of previous days.
- The aggregation which I created looks like the following:
GET /<index>/_search
{
"size": 0,
"aggs": {
"entity": {
"terms": {
"field": "term.keyword",
"size": 1
},
"aggs": {
"daily_sum": {
"date_histogram": {
"field": "@timestamp",
"calendar_interval": "day"
},
"aggs": {
"cumulative_sum": {
"cumulative_sum": {
"buckets_path": "daily_sum"
}
},
"daily_sum": {
"sum": {
"field": "numeric_field"
}
}
}
}
}
}
}
}
Which outputs the following:
"buckets": [
{
"key": "[per term.keyword] field value",
"doc_count": 216,
"daily_sum": {
"buckets": [
{
"key_as_string": "2024-01-05T00:00:00.000Z",
"key": 1704412800000,
"doc_count": 9,
"daily_sum": {
"value": 6420
},
"cumulative_sum": {
"value": 6420
}
},
{
"key_as_string": "2024-01-06T00:00:00.000Z",
"key": 1704499200000,
"doc_count": 1,
"daily_sum": {
"value": 0
},
"cumulative_sum": {
"value": 6420
}
},
{
"key_as_string": "2024-01-07T00:00:00.000Z",
"key": 1704585600000,
"doc_count": 1,
"daily_sum": {
"value": 0
},
"cumulative_sum": {
"value": 6420
}
},
{
"key_as_string": "2024-01-08T00:00:00.000Z",
"key": 1704672000000,
"doc_count": 12,
"daily_sum": {
"value": 1922.75
},
"cumulative_sum": {
"value": 8342,75
}
}
]
}
}
As you can see, not all dates have a value for the field so the field 'daily_sum' is empty and the field cumulative_sum stays the same (As it should)
Now I want to use this data as input for a ML job where the field cumulative sum is used by the detector and the field (or the value) of key_as_string is used as date field for the time field for the ml job.
To use this as input for a ml job it needs to be a datafeed, but whenever I try to place this aggregation in it, elastic throws the following error:
Date histogram must have nested max aggregation for time_field [@timestamp]
Now I tried doing that, but if I do so, on the dates that there is no value for 'daily_sum' there is also no value for max(@timestamp) as there are no records. But still I want to have the cumulative sum of that day as a value in the ML job. Because now the ML job has empty data points and the line visualization has interruptions making the forecast also not representable.
Now I have some ideas how to fix, but I am not sure how I can accomplish them:
- Make the field 'key_as_string' the date field for the ML job and data feed
- Give the ML job the configuration in case of no data, use the value of the last bucket.
- Change the aggregation?
Can anyone help me how to fix this issue?
With kind regards,
Rick