Machine Learning - Confused on Typical vs Actual Values

Elasticsearch: 7.5
Kibana: 7.5

This is regarding the Machine Learning feature and specifically around how the "Actual" and "Typical" scores are calculated. Before I posted this topic I read the following:

My ultimate goal is to create a near real-time anomaly detection for public endpoint attack so I can create alert(s) and react to potential incidents right away. We have a system called "workhorse" which logs, among many things, URI endpoints that users hit. This specific field is called: json.uri. Let's say this endpoint has various values such as: /api/ABC, /api/XYZ...etc and I would like to know when a single endpoint gets attacked. (by indefinite number of remote IPs)

I have tried creating a few different jobs in Machine Learning:

  1. Single metric job
  2. Multi metric job
  3. Population
  4. Advanced job

An example of it is an advanced job (because I needed to exclude certain json.uri endpoints for which we do healthchecks...etc). Configuration:

"job_id": "uri-based-anomaly-detection",
  "job_type": "anomaly_detector",
  "job_version": "7.5.1",
  "description": "",
  "create_time": 1582470613404,
  "analysis_config": {
    "bucket_span": "15m",
    "detectors": [
      {
        "detector_description": "high_count over \"json.uri\"",
        "function": "high_count",
        "over_field_name": "json.uri",
        "detector_index": 0
      }
    ],
    "influencers": [
      "json.uri"
    ]

The Challenge | Typical vs Actual

As soon as I started running the job, it started detecting an endpoint: /api/***** as anomaly and gave it critical and major scores. Upon looking at the result, the job showed the "Typical" value is something like 1.81 but the "Actual" value is some 30-40K. See screenshot below:

This endpoint did not get attacked and there should not be an anomaly. If I search for this specific endpoint in Kibana then I can see that there is no attack whatsoever:

And if I zoom out to a 3-Day graph you can clearly see that the traffic goes up during busier hours and goes down during night:

Questions

So while I think I might be a getting hang of the Machine Learning feature, there are also some ambiguities that I am not sure how to answer.

  1. The specific json.uri value in the example above gets hit around 30-40K anyways, but why would the "Actual" be so low i.e 1.81? I believe this "Actual" being so low is what is driving the high anomaly score. Is it because it is comparing this specific endpoint to the other endpoints or is it because it is not scanning enough data or? (But this specific service received about 500K requests per minute and all of that data is stored in Elasticseaarch. So it wouldn't be a problem of sparse data)

  2. In order to do near real-time anomaly detection, do I first need to create a job and let it analyze enough data in the past to learn? Or is it enough just to configure a job and let it run in real-time? If it is the latter, how does it calculate the "actual" value?

You're getting results you don't expect because you're doing a Population analysis (comparing the API hit rate against all other APIs, not against its own history). The actual value is that specific API's json.uri value and the typical is (for simplicity's sake) the average hit rate across all APIs.

See this blog: https://www.elastic.co/blog/temporal-vs-population-analysis-in-elastic-machine-learning

If you want to compare a specific API's hit rate against it's own historical hit rate then you need to NOT use the over_field_name, but rather the partition_field_name or the by_field_name

Thanks @richcollier for the response.

When you pointed out the over_field_name I was surprised because majority of the time I was using the UI to create Multi Metric job and I played with the Population based job only once. I went back and checked a few other test jobs I created and all of them ended up with over_field_name as well. I will see if I can reproduce this behavior and report.

In the meantime, I have created an advanced job, specifically used the partition_field_name and running it now. Will see how it goes. The blog post about Temporal vs Population analysis was helpful. However, according to the blog post for high cardinality field Population is preferred. In my case, the json.uri field unique values easily end up in the 100,000+. But I also need to compare each unique json.uri traffic to its own traffic history. So this tells me: 1) Population is not a good choice. 2) I should add more filters to reduce the cardinality and do a temporal analysis. Does this sound about right?

Ok - good to recognize that if your json.uri field has a high cardinality, then yes, using it as a partition_field_name will make the memory requirements for the job very demanding and likely you would need to increase the analysis_limits.model_memory_limit setting to avoid the job going into a hard_limit. But still - 100,000+ is pretty high...so you might also now want to ask yourself the question as to whether or not you care about all 100,000+ uris equally, or that some are more important than others (like the top 1000 most popular ones, for example)

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.