Simple CPU alert


#1

I have been trying to create a CPU alert in Watcher. I tried both the advanced option as well as the threshold. I can get pretty close, but invariably something goes wrong, despite following many examples in these forums as well as various documentations at elastic.co.

My use case is I want to receive an alert when the CPU (system.cpu.total.norm.pct) is over XX% for X number of minutes. This alert should inform the CPU level and which host it occurred on.

For the threshold alert, I followed this article: https://www.elastic.co/guide/en/kibana/current/watcher-create-threshold-alert.html
Unfortunately I keep getting empty values for the {{ctx.payload.result}}. I noticed in some examples, that field is surrounded by ` (ticks). I tried with and without. I also tried {{ctx.payload.*}}.

I also got very very close with the advanced watch. Everything was fine until I realized it was combining all my hosts for the number of hits check. I tried to add a per_host aggs, but that put such a load on elasticsearch that queries elsewhere were timing out.

{
  "trigger": {
    "schedule": {
      "interval": "1m"
    }
  },
  "input": {
    "search": {
      "request": {
        "search_type": "query_then_fetch",
        "indices": [
          "metricbeat-*"
        ],
        "types": [],
        "body": {
          "query": {
            "bool": {
              "filter": [
                {
                  "range": {
                    "@timestamp": {
                      "gte": "now-{{ctx.metadata.window_period}}"
                    }
                  }
                },
                {
                  "range": {
                    "system.cpu.total.norm.pct": {
                      "gte": "{{ctx.metadata.threshold}}"
                    }
                  }
                }
              ],
              "must": {
                "exists": {
                  "field": "system.cpu.total.norm.pct"
                }
              }
            }
          },
          "aggs": {
            "per_host": {
              "terms": {
                "size": 10,
                  "field": "beat.hostname"
              }
            }
          }
        }
      }
    }
  },
  "condition": {
    "compare": {
      "ctx.payload.hits.total": {
        "gte": "{{ctx.metadata.number_of_hits}}"
      }
    }
  },
  "actions": {
    "notify-slack": {
      "throttle_period_in_millis": 300000,
      "transform": {
        "script": {
          "source": "def df = new DecimalFormat('##.##'); ctx.payload.hits.hits.forEach(hit -> hit._source.system.cpu.total.norm.pct = df.format(hit._source.system.cpu.total.norm.pct * 100)) ; return ctx.payload",
          "lang": "painless"
        }
      },
      "slack": {
        "message": {
          "to": [
            "#me"
          ],
          "text": "The following hosts' cpu usage percent, averaged across cores, has exceeded {{ctx.metadata.threshold_percent}}% CPU.\n To trigger this alert, the server exceeded the threshold {{ctx.metadata.number_of_hits}} or more times in the last {{ctx.metadata.window_period}} minutes with each violation listed. \n{{#ctx.payload.hits.hits}} \n{{_source.beat.hostname}} : {{_source.system.cpu.total.norm.pct}}% : at {{_source.@timestamp}} \n {{/ctx.payload.hits.hits}} \n Kibana Dashboard: https://myserver:5601/goto/7afe71083e0de70d4cfaedec7c628227 \nPlease review the CPU troubleshooting guide."
        }
      }
    }
  },
  "metadata": {
    "threshold_percent": "40",
    "window_period": "1m",
    "threshold": 0.4,
    "number_of_hits": 2
  }
}

(CJ Cenizal) #2

Unfortunately I keep getting empty values for the {{ctx.payload.result}}

I suspect that ctx.payload.result is a typo. Depending on the value you want, it's possible you're looking for {{ctx.payload.hits.total}} instead.

I also got very very close with the advanced watch. Everything was fine until I realized it was combining all my hosts for the number of hits check. I tried to add a per_host aggs, but that put such a load on elasticsearch that queries elsewhere were timing out.

Your watch looks good to me. I see what you mean by the per_host aggs. Let me find someone who has some experience tweaking query performance and I'll get back to you.

CJ


(CJ Cenizal) #3

Hey Mike, I spoke with an Elasticsearch engineer and a performance issue in response to the addition of a terms aggregation like the one you're using would be very unexpected/unusual. You could try increasing the interval of your watch, though 1m should be plenty of time for a query to execute. It seems more likely that your slowdown is coming from somewhere else, though I know how vague and unhelpful that is -- diagnosing performance problems on a forum is generally a challenge. :slight_smile:

I think your approach is on the right track though. If I understand your goal correctly, you want to dynamically check the CPU usage per hostname, which would be something you can do in your condition. You would have to write a script condition which iterates over ctx.payload.aggregations.per_host.buckets, which will be an array of bucket objects containing the host name assigned to a key property. Each bucket object will also have a doc_count with the number of documents falling into the bucket. Here's where you want to compare each doc_count value to ctx.metadata.number_of_hits.

Actually, there's a array compare condition which will do all of this for you. The docs even have an example which looks pertinent to what you're trying to do.

Anyway I hope this helped. Let me know if you have any more questions.

Thanks,
CJ


#4

Thanks cjcenizal, I believe it is now working correctly.
I ended up going with the array compare and now things appear to be working correctly. I didn't remove my previous advanced alert, and that one also started working, but the results have fewer violations listed. So I am confused why it wasn't firing on Friday, but suspect it may be related to CPU of the ElasticSearch server.

Here is what I changed my condition to in case it helps others:

  "condition": {
    "array_compare": {
      "ctx.payload.aggregations.per_host.buckets": {
        "path": "doc_count",
        "gte": {
          "value": "{{ctx.metadata.number_of_hits}}",
          "quantifier": "some"
        }
      }
    }

I have also found that disabling/enabling both these alerts as well as the metricbeats feed had very little impact on Elasticsearch CPU usage. Something else blew up the CPU, and I am not sure what. I can no longer query metricbeat index in discovery.


(CJ Cenizal) #5

Mike, I'm glad you got your watch working and I appreciate you sharing your condition. Would you mind creating a fresh thread to track your problem with querying metricbeat data in Discover? That will make it more discoverable (pun not intended) for others.

Thanks,
CJ