Realtime anomaly score analysis

I'm looking to use X-Pack's anomaly detection in order to check if the document that I'm creating would be considered an anomaly. Is it possible to get the anomaly score of a document almost immediately after storing it? I'm using https://www.elastic.co/blog/alerting-on-machine-learning-jobs-in-elasticsearch-v55 as a guide. Thanks in advance for any pointers.

Hi Stephen,

ML anomaly detection aggregates the data into time buckets. Thus, anomaly results are created for a bucket, not for each separate raw document.

For example, we could be trying to detect anomalies with regard to the total bytes written on the network by a machine. Each document may represent a network package and we may have many of those per second. It is very beneficial to bucket those measurements together in a suitable bucket_span (e.g. 1 minute, 5 minutes, etc., depending on the data and the use case) for a number of reasons. Now, for this example, we may use a sum function which means we sum all those bytes together. At the end of the bucket, we have a single measurement of the total amount of bytes written going out in the network during that bucket of time. That is the value we model and that is the value we perform anomaly detection on. Of course, we have different measurements/models per time series.

Results are normally calculated after a bucket is complete. However, there is functionality to calculate interim results using the Flush API.

Having explained a bit how the analysis works, does this cover your question? If not, could you please be more specific about your use case?

Thank you dmitri for your thoughtful response.

In my use case, I'm looking to create a document - representing an incoming request - and immediately** determine whether to block or accept the request, based on the likelihood that the request is an anomaly to normal traffic patterns. I am okay with false-positives, as I provide users with a means of working around the issue of being blocked.

** My definition of "immediately" is somewhat loose - I can tolerate a delay of 0-2 seconds.

If I were to run the Flush API with each new document I create (which occurs 100s of times per second), am I barking up the wrong tree here? I assume that is an expensive operation to be doing with every new document.

Would this alternative be any better?

  1. For the document field in question, use the similarity model to provide a similarity ranking
  2. Run the Flush API on an occasional interval (once per minute, for example)
  3. When a new document is created, search for and retrieve a document with the most field similarity (one that is old enough to have been "flushed"/processed for anomaly detection).
  4. If the older, retrieved document was considered an anomaly, then consider this new document an anomaly as well.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.