Need help with forecasting based on counts in a specific interval - Math guys, let's do this

Posted in the Kibana forum, but was pointed here. So for my particular need, I am attempting to make interval forecasting based upon counts. IE, we look through the logs for a certain LogEntryPhrase, that if exists, signifies that a certain popup was successful. I want to know how many popups occurred in the last 15 minutes. And then compare that to the same interval one week ago, two weeks ago.

So for Monday from 9:00 - 9:15, how does that compare to last Monday, and the Monday before that, etc. Think call center forecasting based upon counts. Unfortunately, Kibana doesn't have this capability with date histograms, which is crucial for a lot of the built in smoothing / prediction methods (ewma/holt_winters) since my interval needs to be specific time frame, on various days.

IE counts from 9:00 - 9:15 arn't very indicative of the expected counts from 9:15 - 9:30, but the 9:15-9:30 from the last few Mondays, is.

Below is my current hack method of doing it, but again, I will miss out on a ton of the smoothing techniques necessary. (How do I handle missing data / anomalies)

Any advice, or workarounds that someone has found that works for them?

{
"size": 0,
"query": {
"bool": {
"filter": {
"range": {
"@timestamp": {
"gte": "now-10w"
}
}
},
"must": [Phrases I'm looking for are here ]
}
},
"aggs": {
"history": {
"date_range": {
"field": "@timestamp",
"ranges": [
{
"from": "now-5w-15m",
"to": "now-5w"
},
{
"from": "now-4w-15m",
"to": "now-4w"
},
{
"from": "now-3w-15m",
"to": "now-3w"
},
{
"from": "now-2w-15m",
"to": "now-2w"
},
{
"from": "now-1w-15m",
"to": "now-1w"
}
]
},
"aggs": {
"my_count": {
"sum": {
"script": "1"
}
}
}
},
"my_stats": {
"extended_stats_bucket": {
"buckets_path": "history>my_count"
}
},
"today": {
"date_range": {
"field": "@timestamp",
"ranges": [
{
"from": "now-15m",
"to": "now"
}
]
}
}
}
}

Which gives me the output, I need. It's also incredibly tedious once we increase storage

For example: Results
"Todays"
"aggregations": {
"today": {
"buckets": [
{
"key": "2017-03-02T15:10:38.439Z-2017-03-02T15:25:38.439Z",
"from": 1488467438439,
"from_as_string": "2017-03-02T15:10:38.439Z",
"to": 1488468338439,
"to_as_string": "2017-03-02T15:25:38.439Z",
"doc_count": 23
}
]
}

And then we get a few buckets that look like this, with their aggregated stats below:
{
"key": "2017-02-23T15:10:38.439Z-2017-02-23T15:25:38.439Z",
"from": 1487862638439,
"from_as_string": "2017-02-23T15:10:38.439Z",
"to": 1487863538439,
"to_as_string": "2017-02-23T15:25:38.439Z",
"doc_count": 28,
"my_count": {
"value": 28
}
}
]
},
"my_stats": {
"count": 5,
"min": 19,
"max": 28,
"avg": 24.4,
"sum": 122,
"sum_of_squares": 3023,
"variance": 9.239999999999963,
"std_deviation": 3.039736830714127,
"std_deviation_bounds": {
"upper": 24.4,
"lower": 24.4
}
}

But again, I really am missing out on smoothing techniques that handle missing/anomalies. Any ideas?

You could try using the Serial Differencing pipeline agg:

Serial differencing is a technique where values in a time series are subtracted from itself at different time lags or periods. For example, the datapoint f(x) = f(xt) - f(xt-n), where n is the period being used.

So in that setup, you do a date_histo with 15min intervals over the entire stretch of time (like a month), and calculate a metric for each 15min bucket.

Then you use a serial-difference to subtract each bucket from n-7 buckets ago, which spits out the difference as a new metric. Then finally you could use the Moving Average pipeline to smooth out those differences, or something similar.

Pipelines sorta have tolerance for gappy data, depending on what you need. If the gaps are sub-15min you won't notice, as they'll be hidden by the bucketing granularity. But if the gaps are longer (like an entire hour missing), you can configure the gap_policy to insert_zero, which will add zero-value buckets for the gap.

We don't, unfortunately, have interpolation gap policy yet, which would be ideal in a situation like this.

Thanks for the response. It could work, but I need a way to subtract each bucket from the average of the buckets (n-7, n-14, n-21, n-28 ) etc. Is this possible?

I need to use the same interval over the last x weeks (ie last x mondays from 9:15-9:30) average them together to get a rough estimate of what to expect in that time interval for that particular day, and then compare current day to that.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.