Watcher alert response dont make sense

Hello i have an advanced watcher alert created that calculates the 500 http statuses in my logs the last 1 minute.

Here is my alert:

{
  "trigger": {
    "schedule": {
      "cron": "0 * 2-20 ? * * *"
    }
  },
  "input": {
    "search": {
      "request": {
        "search_type": "query_then_fetch",
        "indices": [
          "apm-*"
        ],
        "rest_total_hits_as_int": true,
        "body": {
          "size": 0,
          "query": {
            "bool": {
              "must_not": {
                "term": {
                  "transaction.name": "TokenEndpoint#postAccessToken"
                }
              },
              "must": [
                {
                  "terms": {
                    "host.hostname": [
                      "sag-prd-cas-025.sag.services",
                      "sag-prd-cas-026.sag.services",
                      "sag-prd-cas-027.sag.services",
                      "sag-prd-cas-028.sag.services",
                      "sag-prd-cas-029.sag.services",
                      "sag-prd-cas-030.sag.services"
                    ]
                  }
                }
              ],
              "filter": [
                {
                  "range": {
                    "@timestamp": {
                      "gte": "now-1m"
                    }
                  }
                },
                {
                  "range": {
                    "http.response.status_code": {
                      "gte": 500,
                      "lte": 600
                    }
                  }
                }
              ]
            }
          },
          "aggs": {
            "hosts": {
              "terms": {
                "field": "host.hostname"
              },
              "aggs": {
                "transactions": {
                  "terms": {
                    "field": "transaction.name"
                  },
                  "aggs": {
                    "status": {
                      "terms": {
                        "field": "http.response.status_code"
                      }
                    }
                  }
                }
              }
            }
          }
        }
      }
    }
  },
  "condition": {
    "compare": {
      "ctx.payload.hits.total": {
        "gt": 30
      }
    }
  },
  "actions": {
    "send_email": {
      "email": {
        "profile": "standard",
        "to": [
          "alexandros.ananikidis@sag-ag.ch,Manuel.Fischer@sag-ag.ch,D-GRPSAGInformatike-commerce@sag-ag.ch,Panayiotis.Stathis@sag-ag.ch,thi.nguyen@bbv.vn,franco.chiellino@umb.ch,markus.brenner@sag-ag.ch,matthias.rohrbach@sag-ag.ch"
        ],
        "subject": "[PROD] 5xx HTTP status code detected",
        "body": {
          "html": """<h3>The Watcher has reached {{ctx.payload.hits.total}} time(s) the (500-600) http error status codes threshold the last 1 minute.

 The detailed results are the following: 

</h3>
 
<h2>Hosts:</h2>

 {{#ctx.payload.aggregations.hosts.buckets}}<h4>Host=({{key}})--------------------------</h4>
     <b>&ensp;-Transactions per Host:</b>
   <ul>
   {{#transactions.buckets}}<b>Transaction</b>={{key}}
     <dl>
      {{#status.buckets}}
         <li>&ensp;  <b>- Status</b> "{{key}}" <b>- Count</b> {{doc_count}} </li>
      {{/status.buckets}}
     </dl>
   {{/transactions.buckets}}
   </ul>
 {{/ctx.payload.aggregations.hosts.buckets}}         """
        }
      }
    }
  }
}

The results when my alert runs dont make sense to me because from one point i see for example that the ctx.payload.hits.total has the value 40 but the buckets count are only 19 (see the example code result right below)

"result": {
    "execution_time": "2020-09-02T07:05:00.442Z",
    "execution_duration": 182,
    "input": {
      "type": "search",
      "status": "success",
      "payload": {
        "_shards": {
          "total": 56,
          "failed": 0,
          "successful": 56,
          "skipped": 0
        },
        "hits": {
          "hits": [],
          "total": 40,
          "max_score": null
        },
        "took": 29,
        "timed_out": false,
        "aggregations": {
          "hosts": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [
              {
                "doc_count": 40,
                "transactions": {
                  "doc_count_error_upper_bound": 0,
                  "sum_other_doc_count": 0,
                  "buckets": [
                    {
                      "doc_count": 19,
                      "key": "ArticleSearchController#searchArticlesByCateIdsAndVehIds",
                      "status": {
                        "doc_count_error_upper_bound": 0,
                        "sum_other_doc_count": 0,
                        "buckets": [
                          {
                            "doc_count": 19,
                            "key": 500
                          }
                        ]
                      }
                    }
                  ]
                },
                "key": "sag-prd-cas-025.sag.services"
              }
            ]
          }
        }
      },

Does someone knows what i might be doing wrong here?

Thank you

Hey,

this is just a blind assumption, but is it possible, that some documents don't have that field set and thus are not part of the aggregation? hostname field might be set, but transaction.name might not be or status code...

--Alex

Hello Alex,

Extremely good observation i believe that was the case.

Thank you alot

Best regards,
Alexandros

1 Like