How to set watcher alerts, on aggregations

alerting

(dina) #1

I would like to monitor my errors on elasticsearch.

I would like to get a notification if a certain error occurred more than a certain number of times (lets say 2 times) in a time period of one hour.

For example if these are my error log in the last 1 hour:

{msg: "storage_failed", level: "error", name: "jim"}
{msg: "connection_closed", level: "error", name: "jack"}
{msg: "error_occurred", level: "error", name: "jay"}
{msg: "storage_failed", level: "error", name: "sam"}
{msg: "connection_closed", level: "error", name: "jack"}
{msg: "connection_closed", level: "error", name: "tom"}

I would get 2 email notifications

1) error: connection_closed 3 times
2) error: storage_failed 3 times

if I received a notification for certain error, notification on that error should be quited for 1 hour (using throttle_period).

in the example above:
notification on storage_failed and connection_closed will be quited,
but if other error received - notification will be alerted

note: my error message are dynamic, I do not know them in advance

here is what i tried:

curl -XPUT 'https://elastic-instance:9243/_xpack/watcher/watch/log_error_watch?pretty' -H 'Content-Type: application/json' -d'
{
  "trigger" : {"schedule" : { "interval" : "1m" }},
  "input" : {
    "search" : {
      "request" : {
        "indices" : [ "logs" ],
        "body" : {
          "query": {
            "bool": {
              "must": [
                { "match_phrase": { "level": "error" } },
                {"range" : {"timestamp" : {"gte": "now-1h", "lte": "now"}}}
              ]
            }
          },
          "aggs": {
            "error_msg": {
              "terms": {
                "field": "msg.keyword"
              }
            }
          }
        }
      }
    }
  },
  "condition" : {
    "compare" : { "ctx.payload.aggregations.error_msg.buckets.0.doc_count" : { "gt" : 2 }}
  },
  "actions" : {
    "email_administrator" : {
      "throttle_period": "2h",
      "email" : {
        "to" : "example@gmail.com",
        "subject" : "Encountered {{ctx.payload.aggregations.error_msg.buckets.0.doc_count}} errors",
        "body" : "Too many error in the system, see attached data",
        "attachments" : {
          "attached_data" : {
            "data" : {
              "format" : "json"
            }
          }
        },
        "priority" : "high"
      }
    }
  }
}
'

this is the notification I get:

{
  "ctx" : {
    "metadata" : null,
    "watch_id" : "log_error_watch",
    "payload" : {
      "_shards" : {
        "total" : 5,
        "failed" : 0,
        "successful" : 5
      },
      "hits" : {
        "hits" : [
          {
            "_index" : "logs",
            "_type" : "event",
            "_source" : {
              "request" : "GET index.html",
              "status_code" : 404,
              "level" : "error",
              "message" : "ppppp",
              "timestamp" : "2017-07-31T12:05:22.119Z"
            },
            "_id" : "AV2YicoWSIeOW7mgwgRM",
            "_score" : 1.0870113
          },
          {
            "_index" : "logs",
            "_type" : "event",
            "_source" : {
              "request" : "GET index.html",
              "status_code" : 404,
              "level" : "error",
              "message" : "ooooooooo",
              "timestamp" : "2017-07-31T12:05:22.119Z"
            },
            "_id" : "AV2YifZ1SIeOW7mgwgRR",
            "_score" : 1.0870113
          },
          ...
        ],
        "total" : 4,
        "max_score" : 1.1823215
      },
      "took" : 1,
      "timed_out" : false,
      "aggregations" : {
        "error_msg" : {
          "doc_count_error_upper_bound" : 0,
          "sum_other_doc_count" : 0,
          "buckets" : [
            {
              "doc_count" : 2,
              "key" : "ooooooooo"
            },
            {
              "doc_count" : 2,
              "key" : "ppppp"
            }
          ]
        }
      }
    },
    "id" : "log_error_watch_6fd76d9d-05bc-4e75-962e-26f86259b88f-2017-07-31T12:10:02.895Z",
    "trigger" : {
      "triggered_time" : "2017-07-31T12:10:02.895Z",
      "scheduled_time" : "2017-07-31T12:10:02.895Z"
    },
    "vars" : { },
    "execution_time" : "2017-07-31T12:10:02.895Z"
  }
}

now this how do I iterate over all buckets of aggregation - and send notification for each one which doc_count is greater than 2?

and how do I set the throttle_period for the certain error log?


(Alexander Reelsen) #2

Hey,

if you need to notify by error group/message, you will need dedicated watches for each of those group in order to have throttling up and running, there is no way around that.

If you do not know them in advance, it might make sense to have some generic catch all watch (that alerts all the time) and then add new watches, if you know the message over time.

Another idea might be to always alert on all messages, but implement the throttling yourself in the script condition (because this is custom logic) so you decide for example based on the last 5 runs (which are stored in the watch history and can be queried using a chained input), if something should be triggered or not.

Another current limitation is the fact, that you can only send out a single email per watch, you cannot send a dedicated email per message log type or something.

Hope that helps.

--Alex


(system) #3

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.