How to make a single watcher monitor all the services

Hi,
I am new to watcher, sorry if it sounds silly.
Can you please help to make a single watcher in ELK to monitor all of my micro services(i have 7), currently i can get alerting, with one watcher per service.

Following is my current watcher JSON,

{
  "trigger": {
    "schedule": {
      "interval": "5m"
    }
  },
  "input": {
    "search": {
      "request": {
        "search_type": "query_then_fetch",
        "indices": [
          "metricbeat-*"
        ],
        "types": [],
        "body": {
          "query": {
            "bool": {
              "must": [
                {
                  "range": {
                    "@timestamp": {
                      "gte": "now-60s"
                    }
                  }
                },
                {
                  "match": {
                    "fields.service": "Service-1"
                  }
                }
              ]
            }
          },
          "aggs": {
            "metricAgg": {
              "avg": {
                "field": "system.cpu.user.pct",
              }
            }
          }
        }
      }
    }
  },
  "condition": {
    "script": {
      "source": "if (ctx.payload.aggregations.metricAgg.value > params.threshold) { return true; } return false;",
      "lang": "painless",
      "params": {
        "threshold": 0.4
      }
    }
  },
  "actions": {
    "notify-slack": {
      "throttle_period_in_millis": 3600000,
      "slack": {#slack-message}
    }
  }
}

I tried to apply * in place of service name, but then it searches for a service with name *
If i remove , the below filter
{
"match": {
"fields.service": "Service-1"
}
}

i am getting the results from one service each time i simulate, in a sequential manner, i.e., simulation-1, gets the metric from service-1 , simulation -2 gets the metric from service-2 and so on.

Kindly let me know, if there is a possible solution for getting single watcher to monitor all of my services

Hey,

first, you can use markdown in here. Formatting your watch JSON marking it as a code snippet will make endlessly more readable.

So, if I understood you correctly, you want one watch for monitoring all of your services. My suggestion here would be to change the query to search for errors, but to aggregate on the service name using a terms aggregation. This way you will see all services having an error listed as part of the aggregation response.

If this is not what you want, correct me, where my interpretation failed.

Thanks!

--Alex

Hi @spinscale

Thank you for the prompt response,

You are correct, about a single watcher, for monitoring all services. But, we don't want to query to search for errors.

We are trying to send a notification, when the average CPU/Memory usage goes beyond a threshold.

In the initial post you could see that, the watcher is aggregating the CPU usage for individual services, per watcher.(formatted as mentioned)

We want the watcher to do the same, but for all the services instead of a specific service. Could you please help us with it

P.S. I have also mentioned the brute force attempts, i have tried.

hey,

you still want to have this CPU usage on a per microservice base I assume? This is the reason why I said that aggregations might be interesting.

Your query would look like this (I'm all assuming things here)

find all documents

  • last 5 minutes till now
  • cpu > threshold

optional: then aggregate on service.

condition: more than 0 hits

action: then an email listing the services above threshold

hope this makes sense.

Thank you @spinscale, yes, i would need CPU per service.

i will try the watcher structure you suggested and post here, with the updates :slight_smile:

Hi @spinscale,

i tried to aggregate services with terms aggregation , as well as get CPU utilization with metric aggregation. But still the watcher simulation, results with random services for each time we run it.

Below is the updated watcher, could you please check and tell, where i am wrong,

{
 "trigger": {
   "schedule": {
     "interval": "5m"
   }
 },
 "input": {
   "search": {
     "request": {
       "search_type": "query_then_fetch",
       "indices": [
         "metricbeat-*"
       ],
       "types": [],
       "body": {
         "query": {
           "bool": {
             "must": [
               {
                 "range": {
                   "@timestamp": {
                     "gte": "now-60s"
                   }
                 }
               }
             ]
           }
         },
         "aggs": {
           "metricAgg": {
             "avg": {
               "field": "system.cpu.user.pct"
             }
           },
           "services": {
              "terms": {
                "field" : "services"
              }
           }
         }
       }
     }
   }
 },
 "condition": {
   "script": {
     "source": "if (ctx.payload.aggregations.metricAgg.value > params.threshold) { return true; } return false;",
     "lang": "painless",
     "params": {
       "threshold": 0.9
     }
   }
 },
 "actions": {
   "notify-slack": {
     "throttle_period_in_millis": 3600000,
     "slack": {
       "message": {message }
   }
 }
}

hm, I think you want to group on services first, then on percentage.

If that's not what you are after, I think some more explanation would be helpful, what exactly you are referring to as 'random' here.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.