Better define Kibana Watchers for unavailable Pod

2021-04-14T22:00:00Z

Good morning,

I could need assistance with a particular task. Basically, we implemented a Watcher that is triggered whenever a Pod is unavailable, and we will export this trigger alarm to Opsgenie and MS Teams for notifications to our team.
For now, it works reasonably well: whenever a pod is unavailable, an alarm is shown. The issue we are dealing with, is in that we would prefer the trigger to be better tailored for our specific needs. We do not only need to know whether a pod is unavailable, but how many of them are unavailable, and the specific name / ID of such pods.

We tried to thoroughly read the available documentation, we tried both with editing the json file (through the Watcher section), and with the "guided" alarm procedure (Create alert, Inventory). Our current .json file is available in this topic (below).
Both seems limited tools for the task we are dealing with - at least, we investigated for some days without finding a proper solution through the current API. The version we have installed is v 7.12.0.

We would also (in the future) need to tailor the trigger in a more intelligent way - for instance, we would like to get rid of false positives (e.g. alarms that are there for less than 5 minutes), and in general to have better useful metrics to trigger the alarm only when it is really needed.

Related discussion available here.
Enrich data to send to connectors on Watcher - Elastic Stack / Kibana - Discuss the Elastic Stack

For any other information that it may be needed, feel free to ask

Thank you,

G.

current json file:

{
  "trigger": {
    "schedule": {
      "interval": "1m"
    }
  },
  "input": {
    "search": {
      "request": {
        "search_type": "query_then_fetch",
        "indices": [
          "metricbeat-*"
        ],
        "rest_total_hits_as_int": true,
        "body": {
          "size": 0,
          "query": {
            "bool": {
              "filter": {
                "range": {
                  "@timestamp": {
                    "gte": "{{ctx.trigger.scheduled_time}}||-5m",
                    "lte": "{{ctx.trigger.scheduled_time}}",
                    "format": "strict_date_optional_time||epoch_millis"
                  }
                }
              }
            }
          },
          "aggs": {
            "metricAgg": {
              "max": {
                "field": "kubernetes.deployment.replicas.unavailable"
              }
            }
          }
        }
      }
    }
  },
  "condition": {
    "script": {
      "source": "if (ctx.payload.aggregations.metricAgg.value > params.threshold) { return true; } return false;",
      "lang": "painless",
      "params": {
        "threshold": 0
      }
    }
  },
  "actions": {
    "webhook_1": {
      "webhook": {
        "scheme": "https",
        "host": "api.eu.opsgenie.com",
        "port": 443,
        "method": "post",
        "path": "/v1/json/eswatcher",
        "params": {
          "apiKey": omitted
        },
        "headers": {
          "Content-Type": "application/json"
        },
        "body": "{{#toJson}}ctx{{/toJson}}"
      }
    }
  },
  "transform": {
    "script": {
      "source": "HashMap result = new HashMap(); result.result = ctx.payload.aggregations.metricAgg.value; return result;",
      "lang": "painless",
      "params": {
        "threshold": 0
      }
    }
  }
}

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.