How to search for anomalies in metrics?

I want to create a small script that will query ES for data, and then send me an alert when it sees that there is "anomalous" data.

For example, if the cpu load on a vm is suddenly spiking. I'd like an alert. That's easy enough to do if I just set a threshold, but I'd rather check for a sudden change in load/usage. Some vm's will naturally have a high cpu load, or RAM usage, with others will not.

I am digging into Elasticsearch Query DSL and the various aggregations to try and create my own script for this. Basically, run a query, check for a condition, and then send an alert, or not.

All my search results on this topic end up at proprietary solutions or Elastalert. I have no budget for this, and my attempts at getting Elastalert working were not successful. Though I may revisit once I understand how to search ES better.

Here are a few specific things I want to watch for:

A) If CPU load has gone up by more than 200% in the past 30 minutes.
B) If RAM usage has gone up by more than 200% in the past 30 minutes.
C) If the number of Apache requests has suddenly gone down by more than 50% in the past 60 minutes.

How would you go about watching for those things?

In my research, I discovered the Median Absolute Deviation aggregation. Would watching that be a way to get close to what I'm after?

I was able to build this query:

GET /_search
{
 "query": {
    "bool": {
      "must": [
        {
          "match_all": {}
        }
      ],
      "filter": [
        {
          "match_phrase": {
            "host.name": "learnescentos7"
          }
        },
        {
          "match_phrase": {
            "agent.type": "metricbeat"
          }
        },
        {
          "match_phrase": {
            "metricset.name": "load"
          }
        },
        {
          "range": {
            "@timestamp": {
              "gte": "now-30m",
              "lte": "now",
              "time_zone": "America/Los_Angeles"
            }
          }
        }
      ],
      "should": [],
      "must_not": []
    }
  },
  "aggs": {
    "avg_load_1": { "avg": { "field": "system.load.1" }},
    "max_load_1": { "max": { "field": "system.load.1" }},
    "min_load_1": { "min": { "field": "system.load.1" }},
    "variability_1": { "median_absolute_deviation": { "field": "system.load.1" }},
    "avg_load_5": { "avg": { "field": "system.load.5" }},
    "max_load_5": { "max": { "field": "system.load.5" }},
    "min_load_5": { "min": { "field": "system.load.5" }},
    "variability_5": { "median_absolute_deviation": { "field": "system.load.5" }},
    "avg_load_15": { "avg": { "field": "system.load.15" }},
    "max_load_15": { "max": { "field": "system.load.15" }},
    "min_load_15": { "min": { "field": "system.load.15" }},
    "variability_15": { "median_absolute_deviation": { "field": "system.load.15" }}
  }
}

Would alerting on variability_1 being larger than 0 get me anywhere close to knowing if cpu load has gone up? I think I might be off track here since I think that aggregation will represent load going down as much as it does going up....

Any advice would be welcome. :slight_smile: Thanks!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.