Although this idea has been around for a while - I've decided to document this technique because it recently came up in a discussion
Problem statement: How to analyze the ratio of documents over time with Anomaly Detection? For example, analyze the ratio of 404s to overall traffic volume in web access logs over time.
Solution: First are foremost, one first needs to understand a few concepts:
-
Elasticsearch queries can implement aggregations, which are on-the-fly summarizations of the data. In particular to this example, one of the relevant aggregation types is called a
bucket_script
aggregation: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-pipeline-bucket-script-aggregation.html -
ML jobs can use aggregations as input. See the following docs: https://www.elastic.co/guide/en/machine-learning/7.6/ml-configuring-aggregation.html
For this specific example, the following would be done:
First, create the job definition (note in all examples replace _xpack/ml
with _ml
in versions 7.x and beyond):
PUT _xpack/ml/anomaly_detectors/web_ratio
{
"description": "Ratio of 404s to Total traffic",
"analysis_config": {
"bucket_span": "1h",
"detectors": [
{
"detector_description": "mean(ratio)",
"function": "mean",
"field_name": "ratio"
}
],
"summary_count_field_name": "doc_count"
},
"model_plot_config": {
"enabled": "true"
},
"data_description": {
"time_field": "@timestamp"
}
}
Next, define the datafeed:
PUT _xpack/ml/datafeeds/datafeed-web_ratio/
{
"job_id": "web_ratio",
"indices": [
"gallery-*"
],
"aggregations": {
"buckets": {
"date_histogram": {
"field": "@timestamp",
"interval": "1h",
"time_zone": "UTC"
},
"aggregations": {
"@timestamp": {
"max": {
"field": "@timestamp"
}
},
"total_count": {
"value_count": {
"field": "_index"
}
},
"404s": {
"filter": {
"term": {
"status": "404"
}
}
},
"ratio": {
"bucket_script": {
"buckets_path": {
"total_count": "total_count.value",
"four_oh_fours": "404s._count"
},
"script": "if(params.total_count>0){params.four_oh_fours / params.total_count} else{0}"
}
}
}
}
}
}
Notice that we calculate the ratio, but ensure there is no divide-by-zero condition.
We can test the datafeed with the _preview
endpoint:
GET _xpack/ml/datafeeds/datafeed-web_ratio/_preview
which will yield something like:
{
"@timestamp" : 1483347581000,
"ratio" : 0.0,
"doc_count" : 31
},
{
"@timestamp" : 1483351122000,
"ratio" : 0.0,
"doc_count" : 21
},
{
"@timestamp" : 1483354720000,
"ratio" : 0.008130081300813009,
"doc_count" : 123
},
{
"@timestamp" : 1483358376000,
"ratio" : 0.011695906432748537,
"doc_count" : 171
},
{
"@timestamp" : 1483361920000,
"ratio" : 0.08695652173913043,
"doc_count" : 23
},
...
Now, when we run the job we will see results like:
Which shows a case where the ratio of 404s to the overall traffic value is higher than normal (in this case, like 99% of the traffic). This is corroborated by looking at the raw data: