Simple CPU alert


#1

I have been trying to create a CPU alert in Watcher. I tried both the advanced option as well as the threshold. I can get pretty close, but invariably something goes wrong, despite following many examples in these forums as well as various documentations at elastic.co.

My use case is I want to receive an alert when the CPU (system.cpu.total.norm.pct) is over XX% for X number of minutes. This alert should inform the CPU level and which host it occurred on.

For the threshold alert, I followed this article: https://www.elastic.co/guide/en/kibana/current/watcher-create-threshold-alert.html
Unfortunately I keep getting empty values for the {{ctx.payload.result}}. I noticed in some examples, that field is surrounded by ` (ticks). I tried with and without. I also tried {{ctx.payload.*}}.

I also got very very close with the advanced watch. Everything was fine until I realized it was combining all my hosts for the number of hits check. I tried to add a per_host aggs, but that put such a load on elasticsearch that queries elsewhere were timing out.

{
  "trigger": {
    "schedule": {
      "interval": "1m"
    }
  },
  "input": {
    "search": {
      "request": {
        "search_type": "query_then_fetch",
        "indices": [
          "metricbeat-*"
        ],
        "types": [],
        "body": {
          "query": {
            "bool": {
              "filter": [
                {
                  "range": {
                    "@timestamp": {
                      "gte": "now-{{ctx.metadata.window_period}}"
                    }
                  }
                },
                {
                  "range": {
                    "system.cpu.total.norm.pct": {
                      "gte": "{{ctx.metadata.threshold}}"
                    }
                  }
                }
              ],
              "must": {
                "exists": {
                  "field": "system.cpu.total.norm.pct"
                }
              }
            }
          },
          "aggs": {
            "per_host": {
              "terms": {
                "size": 10,
                  "field": "beat.hostname"
              }
            }
          }
        }
      }
    }
  },
  "condition": {
    "compare": {
      "ctx.payload.hits.total": {
        "gte": "{{ctx.metadata.number_of_hits}}"
      }
    }
  },
  "actions": {
    "notify-slack": {
      "throttle_period_in_millis": 300000,
      "transform": {
        "script": {
          "source": "def df = new DecimalFormat('##.##'); ctx.payload.hits.hits.forEach(hit -> hit._source.system.cpu.total.norm.pct = df.format(hit._source.system.cpu.total.norm.pct * 100)) ; return ctx.payload",
          "lang": "painless"
        }
      },
      "slack": {
        "message": {
          "to": [
            "#me"
          ],
          "text": "The following hosts' cpu usage percent, averaged across cores, has exceeded {{ctx.metadata.threshold_percent}}% CPU.\n To trigger this alert, the server exceeded the threshold {{ctx.metadata.number_of_hits}} or more times in the last {{ctx.metadata.window_period}} minutes with each violation listed. \n{{#ctx.payload.hits.hits}} \n{{_source.beat.hostname}} : {{_source.system.cpu.total.norm.pct}}% : at {{_source.@timestamp}} \n {{/ctx.payload.hits.hits}} \n Kibana Dashboard: https://myserver:5601/goto/7afe71083e0de70d4cfaedec7c628227 \nPlease review the CPU troubleshooting guide."
        }
      }
    }
  },
  "metadata": {
    "threshold_percent": "40",
    "window_period": "1m",
    "threshold": 0.4,
    "number_of_hits": 2
  }
}

(CJ Cenizal) #2

Unfortunately I keep getting empty values for the {{ctx.payload.result}}

I suspect that ctx.payload.result is a typo. Depending on the value you want, it's possible you're looking for {{ctx.payload.hits.total}} instead.

I also got very very close with the advanced watch. Everything was fine until I realized it was combining all my hosts for the number of hits check. I tried to add a per_host aggs, but that put such a load on elasticsearch that queries elsewhere were timing out.

Your watch looks good to me. I see what you mean by the per_host aggs. Let me find someone who has some experience tweaking query performance and I'll get back to you.

CJ


(CJ Cenizal) #3

Hey Mike, I spoke with an Elasticsearch engineer and a performance issue in response to the addition of a terms aggregation like the one you're using would be very unexpected/unusual. You could try increasing the interval of your watch, though 1m should be plenty of time for a query to execute. It seems more likely that your slowdown is coming from somewhere else, though I know how vague and unhelpful that is -- diagnosing performance problems on a forum is generally a challenge. :slight_smile:

I think your approach is on the right track though. If I understand your goal correctly, you want to dynamically check the CPU usage per hostname, which would be something you can do in your condition. You would have to write a script condition which iterates over ctx.payload.aggregations.per_host.buckets, which will be an array of bucket objects containing the host name assigned to a key property. Each bucket object will also have a doc_count with the number of documents falling into the bucket. Here's where you want to compare each doc_count value to ctx.metadata.number_of_hits.

Actually, there's a array compare condition which will do all of this for you. The docs even have an example which looks pertinent to what you're trying to do.

Anyway I hope this helped. Let me know if you have any more questions.

Thanks,
CJ


#4

Thanks cjcenizal, I believe it is now working correctly.
I ended up going with the array compare and now things appear to be working correctly. I didn't remove my previous advanced alert, and that one also started working, but the results have fewer violations listed. So I am confused why it wasn't firing on Friday, but suspect it may be related to CPU of the ElasticSearch server.

Here is what I changed my condition to in case it helps others:

  "condition": {
    "array_compare": {
      "ctx.payload.aggregations.per_host.buckets": {
        "path": "doc_count",
        "gte": {
          "value": "{{ctx.metadata.number_of_hits}}",
          "quantifier": "some"
        }
      }
    }

I have also found that disabling/enabling both these alerts as well as the metricbeats feed had very little impact on Elasticsearch CPU usage. Something else blew up the CPU, and I am not sure what. I can no longer query metricbeat index in discovery.


(CJ Cenizal) #5

Mike, I'm glad you got your watch working and I appreciate you sharing your condition. Would you mind creating a fresh thread to track your problem with querying metricbeat data in Discover? That will make it more discoverable (pun not intended) for others.

Thanks,
CJ


#6

Hi cjcenizal, I am still having difficulties with this alert. My per host advanced alert has a bug where it is reporting servers which failed only one CPU check along with the servers which failed multiple times (my threshold is 2 failures).

After discussing with my team, we really want to get threshold alerts working. That seems like it will be much easier to quickly setup future alerts as well as train other teams.

But I am completely lost on the Action part of the alert. To report interesting information on why an alert fired (for example CPU level and host name), we need to access fields in the payload. But none of this appears exposed in a discovered way (for example being able to look at returned data under advanced). I am trying many, many different identifiers including the suggested {{ctx.payload.hits.total}}. But it always results in empty values.

How do I know what results are returned so I can setup these alerts to access and report that data?

Other article I have been following: https://www.elastic.co/blog/creating-a-threshold-alert-in-elasticsearch-is-simpler-than-ever

But in that article, I don't see a group by host clause. So it is more of an example than anything which could be used in real life, right?

https://www.elastic.co/guide/en/kibana/current/watcher-create-threshold-alert.html
This has a section "System load:" which I tried to reproduce, but that doesn't include what the Action would look like.

Thank you very much for the help.


#7

I finally got this working with the threshold alert.

{{#ctx.payload.results}}
Server {{key}}, Decimal Percentage {{value}}

{{/ctx.payload.results}}

One final issue I have been having with this is the percentage formatting. Since the threshold alerts don't provide access to the transformation, I am unsure how to format these percentages. The following doesn't look great:

Server MyServer1, Decimal Percentage 0.011166666666666667

I looked over the mustache documentation and did not find a way to do this. Does anyone have advice on formatting? Ideally I would want it rounded and multiplied by 100.

Server MyServer1, CPU level 11%

(system) closed #8

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.