Set up watcher for alerting high CPU usage by some process

Oleksandr_Novozhylov · July 10, 2018, 1:29pm

Hello!

I'm trying to create a Watcher Alert that will be triggered when some process on a node uses over 0.95% of CPU for the last one hour.

Here is an example of my config:

{
  "trigger": {
    "schedule": {
      "interval": "10m"
    }
  },
  "input": {
    "search": {
      "request": {
        "search_type": "query_then_fetch",
        "indices": [
          "metricbeat*"
        ],
        "types": [],
        "body": {
          "size": 0,
          "query": {
            "bool": {
              "must": [
                {
                  "range": {
                    "system.process.cpu.total.norm.pct": {
                      "gte": 0.95
                    }
                  }
                },
                {
                  "range": {
                    "system.process.cpu.start_time": {
                      "gte": "now-1h"
                    }
                  }
                },
                {
                  "match": {
                    "environment": "test"
                  }
                }
              ]
            }
          }
        }
      }
    }
  },
  "condition": {
    "compare": {
      "ctx.payload.hits.total": {
        "gt": 0
      }
    }
  },
  "actions": {
    "send-to-slack": {
      "throttle_period_in_millis": 1800000,
      "webhook": {
        "scheme": "https",
        "host": "hooks.slack.com",
        "port": 443,
        "method": "post",
        "path": "{{ctx.metadata.onovozhylov-test}}",
        "params": {},
        "headers": {
          "Content-Type": "application/json"
        },
        "body": "{ \"text\": \" ==========\nTest parameters:\n\tthrottle_period_in_millis: 60000\n\tInterval: 1m\n\tcpu.total.norm.pct: 0.5\n\tcpu.start_time: now-1m\n\nThe watcher:*{{ctx.watch_id}}* in env:*{{ctx.metadata.env}}* found that the process *{{ctx.system.process.name}}* has been utilizing CPU over 95% for the past 1 hr on node:\n{{#ctx.payload.nodes}}\t{{.}}\n\n{{/ctx.payload.nodes}}\n\nThe runbook entry is here: *{{ctx.metadata.runbook}}* \"}"
      }
    }
  },
  "metadata": {
    "onovozhylov-test": "/services/T0U0CFMT4/BBK1A2AAH/MlHAF2QuPjGZV95dvO11111111",
    "env": "{{ grains.get('environment') }}",
    "runbook": "http://mytest.com"
  }
}

This Watcher doesn't work when I set the metric system.process.cpu.start_time. Perhaps this metric is not a correct one...

And another issue is that I don't know how to add the system.process.name into a message body.

Thanks in advance for any help!

spinscale · July 13, 2018, 9:30am

can you elaborate what does not work? What do you mean with 'set the metric'? What do you want to do with the start time? Should it be part of the query?

In order to access the process name of the first hit, you can access the hits array from the response like ctx.payload.hits.hits[0]._source.system.process.name. You probably want to add an aggregation on your query to collect all the process names instead of going through the hits though.

Also, there is a dedicated slack action that you could use instead.

Hope this helps!

--alex

Oleksandr_Novozhylov · July 15, 2018, 7:46pm

Thank you for your answer.

I used system.process.cpu.start_time in the query to alert about a process that used over 0.95% of CPU for a particular period of time (e.g. "gte": "now-1h"). However, it didn't work for this purpose because no alerts were sent. So I'm not sure whether this field can be used for such a case.

My issue is that I can't find either a CPU-specific field or some other field to track a process that uses over 0.95% of CPU for a particular period of time.

Thanks, I'll give a try to it!

spinscale · July 17, 2018, 7:05am

Hey,

in order to pinpoint the problem of 'does not work', can we step away from the watch for a second and ensure the query is working as expected?

Can you share your full query and the response? Finding out why there are no responses/data being returned is the first step here I think.

--Alex

system · August 14, 2018, 7:05am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Unable to print CPU usage values in Watcher email alert Elasticsearch elastic-stack-alerting	1	525	March 4, 2019
Simple CPU alert Kibana	7	4529	January 17, 2019
Having 2 threshold values in the same alert (one for start firing and one for stop firing) Elasticsearch	1	366	March 16, 2018
Watcher does not get triggered properly Elasticsearch elastic-stack-alerting	1	312	July 8, 2021
Watcher: IP abuse Elasticsearch elastic-stack-alerting	7	1033	November 28, 2019

Set up watcher for alerting high CPU usage by some process

Related topics