Realtime anomaly score analysis

sgeorge · March 12, 2018, 7:18pm

I'm looking to use X-Pack's anomaly detection in order to check if the document that I'm creating would be considered an anomaly. Is it possible to get the anomaly score of a document almost immediately after storing it? I'm using https://www.elastic.co/blog/alerting-on-machine-learning-jobs-in-elasticsearch-v55 as a guide. Thanks in advance for any pointers.

dmitri · March 13, 2018, 5:11pm

Hi Stephen,

ML anomaly detection aggregates the data into time buckets. Thus, anomaly results are created for a bucket, not for each separate raw document.

For example, we could be trying to detect anomalies with regard to the total bytes written on the network by a machine. Each document may represent a network package and we may have many of those per second. It is very beneficial to bucket those measurements together in a suitable bucket_span (e.g. 1 minute, 5 minutes, etc., depending on the data and the use case) for a number of reasons. Now, for this example, we may use a sum function which means we sum all those bytes together. At the end of the bucket, we have a single measurement of the total amount of bytes written going out in the network during that bucket of time. That is the value we model and that is the value we perform anomaly detection on. Of course, we have different measurements/models per time series.

Results are normally calculated after a bucket is complete. However, there is functionality to calculate interim results using the Flush API.

Having explained a bit how the analysis works, does this cover your question? If not, could you please be more specific about your use case?

sgeorge · March 19, 2018, 10:57pm

Thank you dmitri for your thoughtful response.

In my use case, I'm looking to create a document - representing an incoming request - and immediately** determine whether to block or accept the request, based on the likelihood that the request is an anomaly to normal traffic patterns. I am okay with false-positives, as I provide users with a means of working around the issue of being blocked.

** My definition of "immediately" is somewhat loose - I can tolerate a delay of 0-2 seconds.

If I were to run the Flush API with each new document I create (which occurs 100s of times per second), am I barking up the wrong tree here? I assume that is an expensive operation to be doing with every new document.

Would this alternative be any better?

For the document field in question, use the similarity model to provide a similarity ranking
Run the Flush API on an occasional interval (once per minute, for example)
When a new document is created, search for and retrieve a document with the most field similarity (one that is old enough to have been "flushed"/processed for anomaly detection).
If the older, retrieved document was considered an anomaly, then consider this new document an anomaly as well.

system · April 16, 2018, 10:57pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Struggling to understand the value of ML for my data Elasticsearch	12	2124	January 10, 2018
How to get all relevant data of anomaly into alert message Elasticsearch elastic-stack-machine-learning	7	1538	October 30, 2018
Help for select the best solution for data analyse Elasticsearch es-hadoop	2	1031	July 6, 2017
How to find the Anomalies created documents list Elasticsearch elastic-stack-machine-learning	2	764	October 6, 2017
ML - Not picking up anomalies? Elasticsearch elastic-stack-machine-learning	9	2253	October 7, 2017

Realtime anomaly score analysis

Related topics