Elasticsearch: 7.5
Kibana: 7.5
This is regarding the Machine Learning feature and specifically around how the "Actual" and "Typical" scores are calculated. Before I posted this topic I read the following:
- https://www.elastic.co/blog/machine-learning-anomaly-scoring-elasticsearch-how-it-works
- https://www.elastic.co/blog/changes-to-elastic-machine-learning-anomaly-scoring-in-6-5
- https://www.elastic.co/guide/en/machine-learning/7.5/xpack-ml.html
- Also searched in the discussion forum
My ultimate goal is to create a near real-time anomaly detection for public endpoint attack so I can create alert(s) and react to potential incidents right away. We have a system called "workhorse" which logs, among many things, URI endpoints that users hit. This specific field is called: json.uri. Let's say this endpoint has various values such as: /api/ABC, /api/XYZ...etc and I would like to know when a single endpoint gets attacked. (by indefinite number of remote IPs)
I have tried creating a few different jobs in Machine Learning:
- Single metric job
- Multi metric job
- Population
- Advanced job
An example of it is an advanced job (because I needed to exclude certain json.uri endpoints for which we do healthchecks...etc). Configuration:
"job_id": "uri-based-anomaly-detection",
"job_type": "anomaly_detector",
"job_version": "7.5.1",
"description": "",
"create_time": 1582470613404,
"analysis_config": {
"bucket_span": "15m",
"detectors": [
{
"detector_description": "high_count over \"json.uri\"",
"function": "high_count",
"over_field_name": "json.uri",
"detector_index": 0
}
],
"influencers": [
"json.uri"
]
The Challenge | Typical vs Actual
As soon as I started running the job, it started detecting an endpoint: /api/*****
as anomaly and gave it critical and major scores. Upon looking at the result, the job showed the "Typical" value is something like 1.81 but the "Actual" value is some 30-40K. See screenshot below:
This endpoint did not get attacked and there should not be an anomaly. If I search for this specific endpoint in Kibana then I can see that there is no attack whatsoever:
And if I zoom out to a 3-Day graph you can clearly see that the traffic goes up during busier hours and goes down during night:
Questions
So while I think I might be a getting hang of the Machine Learning feature, there are also some ambiguities that I am not sure how to answer.
-
The specific json.uri value in the example above gets hit around 30-40K anyways, but why would the "Actual" be so low i.e 1.81? I believe this "Actual" being so low is what is driving the high anomaly score. Is it because it is comparing this specific endpoint to the other endpoints or is it because it is not scanning enough data or? (But this specific service received about 500K requests per minute and all of that data is stored in Elasticseaarch. So it wouldn't be a problem of sparse data)
-
In order to do near real-time anomaly detection, do I first need to create a job and let it analyze enough data in the past to learn? Or is it enough just to configure a job and let it run in real-time? If it is the latter, how does it calculate the "actual" value?