Machine Learning predictions are 30 minutes off, raises false positives

We are using machine learning to detect anomalies in the request rates on our API. One of our jobs analyses the event rate per IP for a specific service.

It generally works fine, but has been raising false positives for a specific customer. That customer recently shifted his API usage by 30 minutes. eg. instead of calling our api at 08:00AM, they now call it at 08:30AM. Somehow the ML job hasn't adapted to that yet, even though it now constitutes the majority of the dataset.

On the screenshot above, you can see three sections:

  1. the first part has usage spikes at 8AM, 12PM, 4PM etc (every 4 hours). No anomaly reported. That's good
  2. The second part, where the anomalies start, is where the customer started using our api at 8:30AM instead of 8AM.
  3. The third part, forecast, still expects the spikes at 8AM instead of 8:30AM.

That job is otherwise working fine for all other customers. How can I fix it for this one?

And, more generally, how can I tell the ML job "yes, this is an anomaly" or "no, this is not an anomaly"? Finally, is there a way I could add notes on the anomaly timeline?

The reason this happens is that the anomalies generated by the periodic spikes prevent the new seasonal pattern being learnt for a long time. This is because we avoid learning too much from anomalies. The intention was that if the anomalies are periodic then we would notice this and adapt the seasonal pattern quickly, but there is a bug which occurs when the bucket span is relatively long w.r.t. the length of the repeat. I had, in fact, already fixed this bug in this pull request. At the moment, I expect this change to be released in 7.9.

What can you do to work around this? The best solution available at present is to:

  1. Skip results from this IP for the running job. This will stop any anomalies being raised. See the discussion in this blog. Note you want to choose skip result. Eventually the job should learn the new periodic pattern, but this could take a long time. At this point it should possible to remove this rule. In the mean time...
  2. In order to get anomaly detection for this customer create a new job and filter all data except their IP. Make sure to start this job (in look back) after the time of the shift.

And, more generally, how can I tell the ML job "yes, this is an anomaly" or "no, this is not an anomaly"?

This is possible via rules as discussed. We are considering additional ways that a user can provide feedback. The sort of feedback you'd ideally give in this case I think is relearn the model for this IP starting from after the shift occurs. However, in my opinion this sort of thing is a deficiency in change adaption in the modelling and we should make changes to fix such behaviour. Generally, this is pretty robust and should just work, but this exposed an edge case.

Finally, is there a way I could add notes on the anomaly timeline?

The annotations functionality allows you to do this. See this blog for a discussion.

1 Like

Thank you for this in-depth answer.

We are considering additional ways that a user can provide feedback.

I found myself wanting that feature quite a few time, and would certainly appreciate its addition. It is the sole reason I might want to implement custom ML so far.

I will look into setting up a rule for this occurrence, until 7.9 is released :slight_smile:

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.