Advanced Watcher - If Failed % is greater than some defined threshold value

Hi All,
I need some help to achieve below monitoring condition to enable watcher alert.

  1. We have a domain named "XYZ" as a value to one of the term field - "main.domain"
  2. Under this domain "XYZ", we see logs for several recipients domain - abc.com, def.com, ghi.com etc.,
  3. There are 3 types of message status - Sent, Sending and Failed

At any given time, messages will be either in Sent, Sending or Failed status for the respective recipient domains of the main domain named "XYZ".

Monitoring condition: at any time, if "Failed" message type for any recipient domain (abc.com, def.com, ghi.com etc.,) crosses greater than 5% then an alert need to be triggered.

Currently, we have filtered the main domain named "XYZ" for 5 last minutes and pulled 2 set of aggregations.
a. for all recipient domains along with all 3 message status types
b. for all recipient domains and only with Failed message status type

Now need to figure out a way where in, i can set up alert for below example.
Lets say, in last 5 mins, for the main domain named "XYZ" and for recipient domain "abc.com", if failed messages go greater than 5% then an alert need to be triggered.

Any suggestions or advise would be of great help. Thank you all in advance.

Hey,

it would help what you already tried in order to get an overview. To me this sounds, as if you need to compare two values returned by an aggregation response within your domain. This can be done using a script condition.

The most important thing here is probably not writing the watch, but the correct query that returns all the data required to do the parsing in the condition, you should focus on that first, before putting anything in a watch.

Hope that helps as a start.

--Alex

Thank you Alex. @spinscale

I was working on creating scripts that would generate desired results. Didn't put anything into watcher section :slightly_smiling_face: . However, below are my queries tried so far.

  1. Removed indent to save space here
    This gives me result of total messages in all status broken with each message type and recipient domain for the main domain.
    And if use "value_count" instead of "terms" for message type aggs then i get total value of all 3 message types under each recipient domain.

{"aggs":{"RecipientDomain":{"terms":{"field":"recipientDomain.keyword"},"aggs":{"MessageTypes":{"terms":{"field":"messageType.keyword"}}}}},"size":0,"_source":{"excludes":},
"script_fields":{},"docvalue_fields":[{"field":"@timestamp","format":"date_time"}],"query":{"bool":{"must":[{"match_all":{}},{"match_phrase":{"sendingDomain":
{"query":"main.domain.com"}}},{"range":{"@timestamp":{"format":"strict_date_optional_time","gte":"2022-03-24T15:49:42.890Z","lte":"2022-03-24T15:55:42.891Z"}}}]}}}

My result for this is:
"aggregations" : {
"recipient_domain" : {
"doc_count_error_upper_bound" : 12,
"sum_other_doc_count" : 576,
"buckets" : [
{
"key" : "ABC.com",
"doc_count" : 3479,
"Delivered_Message" : {
"buckets" : {
"messageType:Delivered" : {
"doc_count" : 3420
}
}
},
"Transient_Message" : {
"buckets" : {
"messageType:Transient" : {
"doc_count" : 32
}
}
},
"Failed_Message" : {
"buckets" : {
"messageType:Failed" : {
"doc_count" : 27
}
}
}
},

How can i add script condition to calculate % of failures?

  1. Removed indent to save space here
    This case, i get results again on total messages under each domain and under each status.

{"size":0,"query":{"bool":{"must":[{"match_phrase":{"sendingDomain":{"query":"main.domain.com"}}},{"range":{"@timestamp":{"gte":"2022-03-25T19:44:09.034Z",
"lte":"2022-03-25T19:50:09.034Z"}}}],"must_not":}},"aggs":{"recipient_domain":{"terms":{"field":"recipientDomain.keyword","min_doc_count":20,"size":10,"order":{"_count":"desc"}},
"aggs":
{"Failed_Message":{"filters":{"filters":{"messageType:Failed":{"bool":{"must":,"filter":[{"bool":{"should":[{"match":{"messageType":"Failed"}}],"minimum_should_match":1}}],"should":,"must_not":}}}}},
"Delivered_Message":{"filters":{"filters":{"messageType:Delivered":{"bool":{"must":,"filter":[{"bool":{"should":[{"match":{"messageType":"Delivered"}}],"minimum_should_match":1}}],"should":,"must_not":}}}}},
"Transient_Message":{"filters":{"filters":{"messageType:Transient":{"bool":{"must":,"filter":[{"bool":{"should":[{"match":{"messageType":"Transient"}}],"minimum_should_match":1}}],"should":,"must_not":}}}}},

To this when i add script condition to compute % calculation, i get same results as above but nothing for below condition. As in, i get results for total records in each domain and each status but computing % script part does not show any reference at all. What am i missing here?
I test these in Kibana DevTools.

"ComputePercentage":{"bucket_script":{"buckets_path":{"TD":"Delivered_Message.doc_count","TT":"Transient_Message.doc_count","TF":"Failed_Message.doc_count"},
"script":"(params.TF / (params.TD + params.TT + params.TF)) * 100"}}}}}}

My result for this is:
"aggregations" : {
"RD" : {
"doc_count_error_upper_bound" : 15,
"sum_other_doc_count" : 882,
"buckets" : [
{
"key" : "ABC.com",
"doc_count" : 3651,
"MT" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "Delivered",
"doc_count" : 3573
},
{
"key" : "Transient",
"doc_count" : 51
},
{
"key" : "Failed",
"doc_count" : 27
}
]
}
},

How can i add script condition to calculate % of failures?

Hey,

so basically you have to loop through ctx.payload.aggregations.RD.buckets and for each element you have to first extract the bucket doc_count with key == Delivered in ctx.payload.aggregations.RD.buckets[0] and the same for Transient/Failed. When you have both doc_counts you can divide them.

I think you should find some help in the watcher examples, even though they are already a bit older. See examples/Alerting/Sample Watches at master · elastic/examples · GitHub

--Alex

Thank you for your prompt response Alex @spinscale
Was able to use the ctx.payload aggregations and set up watcher alert too.
I know receive alert when any domain crosses threshold of 5% failure from overall messages.

However, if there are multiple domain failures at the same time, how can i show all failures within same email alert. Rather than sending 1 alert each for each failed domains?

If your search query contains the data for all failures, then you would explicitely add a foreach part to your action, to run this for each failure. If you do not do that you should end up with a single alert? Or do you have an own watch for each customer?

This will be just 1 watcher alert for all recipient domains failing at a given time.
Now we see multiple alerts triggering at a time if there are more than 1 recipient domain failing with 5% rate.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.