Analyzing a ratio of documents over time with Anomaly Detection

Although this idea has been around for a while - I've decided to document this technique because it recently came up in a discussion

Problem statement: How to analyze the ratio of documents over time with Anomaly Detection? For example, analyze the ratio of 404s to overall traffic volume in web access logs over time.

Solution: First are foremost, one first needs to understand a few concepts:

  1. Elasticsearch queries can implement aggregations, which are on-the-fly summarizations of the data. In particular to this example, one of the relevant aggregation types is called a bucket_script aggregation: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-pipeline-bucket-script-aggregation.html

  2. ML jobs can use aggregations as input. See the following docs: https://www.elastic.co/guide/en/machine-learning/7.6/ml-configuring-aggregation.html

For this specific example, the following would be done:

First, create the job definition (note in all examples replace _xpack/ml with _ml in versions 7.x and beyond):

PUT _xpack/ml/anomaly_detectors/web_ratio
{
  "description": "Ratio of 404s to Total traffic",
  "analysis_config": {
    "bucket_span": "1h",
    "detectors": [
      {
        "detector_description": "mean(ratio)",
        "function": "mean",
        "field_name": "ratio"
      }
    ],
    "summary_count_field_name": "doc_count"
  },
  "model_plot_config": {
    "enabled": "true"
  },
  "data_description": {
    "time_field": "@timestamp"
  }
}

Next, define the datafeed:

PUT _xpack/ml/datafeeds/datafeed-web_ratio/
{
  "job_id": "web_ratio",
  "indices": [
    "gallery-*"
  ],
  "aggregations": {
    "buckets": {
      "date_histogram": {
        "field": "@timestamp",
        "interval": "1h",
        "time_zone": "UTC"
      },
      "aggregations": {
        "@timestamp": {
          "max": {
            "field": "@timestamp"
          }
        },
        "total_count": {
          "value_count": {
            "field": "_index"
          }
        },
        "404s": {
          "filter": {
            "term": {
              "status": "404"
            }
          }
        },
        "ratio": {
          "bucket_script": {
            "buckets_path": {
              "total_count": "total_count.value",
              "four_oh_fours": "404s._count"
            },
            "script": "if(params.total_count>0){params.four_oh_fours / params.total_count} else{0}"
          }
        }
      }
    }
  }
}

Notice that we calculate the ratio, but ensure there is no divide-by-zero condition.

We can test the datafeed with the _preview endpoint:

GET _xpack/ml/datafeeds/datafeed-web_ratio/_preview

which will yield something like:

    {
    "@timestamp" : 1483347581000,
    "ratio" : 0.0,
    "doc_count" : 31
  },
  {
    "@timestamp" : 1483351122000,
    "ratio" : 0.0,
    "doc_count" : 21
  },
  {
    "@timestamp" : 1483354720000,
    "ratio" : 0.008130081300813009,
    "doc_count" : 123
  },
  {
    "@timestamp" : 1483358376000,
    "ratio" : 0.011695906432748537,
    "doc_count" : 171
  },
  {
    "@timestamp" : 1483361920000,
    "ratio" : 0.08695652173913043,
    "doc_count" : 23
  },
...

Now, when we run the job we will see results like:

Which shows a case where the ratio of 404s to the overall traffic value is higher than normal (in this case, like 99% of the traffic). This is corroborated by looking at the raw data:

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.