Webhook watcher action timeout

gissavem · November 8, 2021, 10:02am

Hi!

I am trying to set up a Watcher to create incidents when our applications throw errors.

The watcher I've set up looks through our logs for errors and when we hit any I send a webhook request to BetterUptime to create an incident there.

The problem I am facing is that the webhook action is timing out every now and then and I cannot really see the reason why. My guess is that it has something to do with throttling but that is more or less a guess than anything else.

This is how the watcher is set up:

{
  "trigger": {
    "schedule": {
      "interval": "600s"
    }
  },
  "input": {
    "chain": {
      "inputs": [
        {
          "first": {
            "search": {
              "request": {
                "search_type": "query_then_fetch",
                "indices": [
                  "logs-darkside-production*"
                ],
                "rest_total_hits_as_int": true,
                "body": {
                  "size": 1000,
                  "_source": {
                    "excludes": [
                      "fields.Body"
                    ]
                  },
                  "query": {
                    "bool": {
                      "filter": [
                        {
                          "range": {
                            "@timestamp": {
                              "gte": "now-900s"
                            }
                          }
                        },
                        {
                          "term": {
                            "level": "Error"
                          }
                        }
                      ],
                      "must": [
                        {
                          "match": {
                            "fields.Host": "articles.internalapis.svc.cluster.local"
                          }
                        }
                      ]
                    }
                  }
                }
              }
            }
          }
        },
        {
          "second": {
            "transform": {
              "script": {
                "source": "return [ 'from' : Instant.ofEpochMilli(ctx.execution_time.getMillis()).minus(15, ChronoUnit.MINUTES), 'to':  Instant.ofEpochMilli(ctx.execution_time.getMillis()), 'hits': ctx.payload.first.hits.total  ]",
                "lang": "painless"
              }
            }
          }
        }
      ]
    }
  },
  "condition": {
    "compare": {
      "ctx.payload.second.hits": {
        "gt": 0
      }
    }
  },
  "actions": {
    "create_betteruptime_incident": {
      "throttle_period_in_millis": 595000,
      "webhook": {
        "scheme": "https",
        "host": "betteruptime.com",
        "port": 443,
        "method": "post",
        "path": "/api/v2/incidents",
        "params": {},
        "headers": {
          "Authorization": "Bearer ***secret***",
          "Content-Type": "application/json"
        },
        "body": "...",
        "connection_timeout_in_millis": 45000,
        "read_timeout_millis": 45000
      }
    }
  }
}

Now this works most of the time but it seems like every other trigger of this action failes because the aforementioned time out.

Every execution that says firing in this list managed to go all the way through, i.e. we got incidents created via the webhook action.

Every time that the execution failed was reported, this was the details in the watcher:

"actions": [
      {
        "id": "create_betteruptime_incident",
        "type": "webhook",
        "status": "failure",
        "error": {
          "root_cause": [
            {
              "type": "socket_timeout_exception",
              "reason": "Read timed out"
            }
          ],
          "type": "socket_timeout_exception",
          "reason": "Read timed out"
        }
      }
    ]

Am I missing something when it comes to throttling or acknowledgment? Or has this something to do with HTTP-client pooling?

It feels like I am missing something but I have not found any clear answer anywhere in the discussions here or in the documentation to what I could be doing wrong.

Thank you for any input!
Regards

system · December 6, 2021, 10:02am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Watch is stuck at error, always needs a manual restart Elasticsearch elastic-stack-alerting	5	1354	October 26, 2020
Does watcher action webhook involve retrying? Elasticsearch elastic-stack-alerting	1	505	December 6, 2020
ES Watcher Action failed to execute Kibana	2	553	January 16, 2020
Watcher is throwing timeout_exception Kibana elastic-stack-monitoring , elastic-stack-alerting	3	1429	August 10, 2021
Kibana webhook watcher Kibana elastic-stack-alerting	3	247	July 17, 2022

Webhook watcher action timeout

Related topics