Set up watcher for alerting high CPU usage by some process

alerting

(Oleksandr Novozhylov) #1

Hello!

I'm trying to create a Watcher Alert that will be triggered when some process on a node uses over 0.95% of CPU for the last one hour.

Here is an example of my config:

{
  "trigger": {
    "schedule": {
      "interval": "10m"
    }
  },
  "input": {
    "search": {
      "request": {
        "search_type": "query_then_fetch",
        "indices": [
          "metricbeat*"
        ],
        "types": [],
        "body": {
          "size": 0,
          "query": {
            "bool": {
              "must": [
                {
                  "range": {
                    "system.process.cpu.total.norm.pct": {
                      "gte": 0.95
                    }
                  }
                },
                {
                  "range": {
                    "system.process.cpu.start_time": {
                      "gte": "now-1h"
                    }
                  }
                },
                {
                  "match": {
                    "environment": "test"
                  }
                }
              ]
            }
          }
        }
      }
    }
  },
  "condition": {
    "compare": {
      "ctx.payload.hits.total": {
        "gt": 0
      }
    }
  },
  "actions": {
    "send-to-slack": {
      "throttle_period_in_millis": 1800000,
      "webhook": {
        "scheme": "https",
        "host": "hooks.slack.com",
        "port": 443,
        "method": "post",
        "path": "{{ctx.metadata.onovozhylov-test}}",
        "params": {},
        "headers": {
          "Content-Type": "application/json"
        },
        "body": "{ \"text\": \" ==========\nTest parameters:\n\tthrottle_period_in_millis: 60000\n\tInterval: 1m\n\tcpu.total.norm.pct: 0.5\n\tcpu.start_time: now-1m\n\nThe watcher:*{{ctx.watch_id}}* in env:*{{ctx.metadata.env}}* found that the process *{{ctx.system.process.name}}* has been utilizing CPU over 95% for the past 1 hr on node:\n{{#ctx.payload.nodes}}\t{{.}}\n\n{{/ctx.payload.nodes}}\n\nThe runbook entry is here: *{{ctx.metadata.runbook}}* \"}"
      }
    }
  },
  "metadata": {
    "onovozhylov-test": "/services/T0U0CFMT4/BBK1A2AAH/MlHAF2QuPjGZV95dvO11111111",
    "env": "{{ grains.get('environment') }}",
    "runbook": "http://mytest.com"
  }
}

This Watcher doesn't work when I set the metric system.process.cpu.start_time. Perhaps this metric is not a correct one...

And another issue is that I don't know how to add the system.process.name into a message body.

Thanks in advance for any help!


(Alexander Reelsen) #2

can you elaborate what does not work? What do you mean with 'set the metric'? What do you want to do with the start time? Should it be part of the query?

In order to access the process name of the first hit, you can access the hits array from the response like ctx.payload.hits.hits[0]._source.system.process.name. You probably want to add an aggregation on your query to collect all the process names instead of going through the hits though.

Also, there is a dedicated slack action that you could use instead.

Hope this helps!

--alex


(Oleksandr Novozhylov) #3

Thank you for your answer.

I used system.process.cpu.start_time in the query to alert about a process that used over 0.95% of CPU for a particular period of time (e.g. "gte": "now-1h"). However, it didn't work for this purpose because no alerts were sent. So I'm not sure whether this field can be used for such a case.

My issue is that I can't find either a CPU-specific field or some other field to track a process that uses over 0.95% of CPU for a particular period of time.

Thanks, I'll give a try to it!


(Alexander Reelsen) #4

Hey,

in order to pinpoint the problem of 'does not work', can we step away from the watch for a second and ensure the query is working as expected?

Can you share your full query and the response? Finding out why there are no responses/data being returned is the first step here I think.

--Alex


(system) #5

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.