Webhook watcher action timeout

Hi!

I am trying to set up a Watcher to create incidents when our applications throw errors.

The watcher I've set up looks through our logs for errors and when we hit any I send a webhook request to BetterUptime to create an incident there.

The problem I am facing is that the webhook action is timing out every now and then and I cannot really see the reason why. My guess is that it has something to do with throttling but that is more or less a guess than anything else.

This is how the watcher is set up:

{
  "trigger": {
    "schedule": {
      "interval": "600s"
    }
  },
  "input": {
    "chain": {
      "inputs": [
        {
          "first": {
            "search": {
              "request": {
                "search_type": "query_then_fetch",
                "indices": [
                  "logs-darkside-production*"
                ],
                "rest_total_hits_as_int": true,
                "body": {
                  "size": 1000,
                  "_source": {
                    "excludes": [
                      "fields.Body"
                    ]
                  },
                  "query": {
                    "bool": {
                      "filter": [
                        {
                          "range": {
                            "@timestamp": {
                              "gte": "now-900s"
                            }
                          }
                        },
                        {
                          "term": {
                            "level": "Error"
                          }
                        }
                      ],
                      "must": [
                        {
                          "match": {
                            "fields.Host": "articles.internalapis.svc.cluster.local"
                          }
                        }
                      ]
                    }
                  }
                }
              }
            }
          }
        },
        {
          "second": {
            "transform": {
              "script": {
                "source": "return [ 'from' : Instant.ofEpochMilli(ctx.execution_time.getMillis()).minus(15, ChronoUnit.MINUTES), 'to':  Instant.ofEpochMilli(ctx.execution_time.getMillis()), 'hits': ctx.payload.first.hits.total  ]",
                "lang": "painless"
              }
            }
          }
        }
      ]
    }
  },
  "condition": {
    "compare": {
      "ctx.payload.second.hits": {
        "gt": 0
      }
    }
  },
  "actions": {
    "create_betteruptime_incident": {
      "throttle_period_in_millis": 595000,
      "webhook": {
        "scheme": "https",
        "host": "betteruptime.com",
        "port": 443,
        "method": "post",
        "path": "/api/v2/incidents",
        "params": {},
        "headers": {
          "Authorization": "Bearer ***secret***",
          "Content-Type": "application/json"
        },
        "body": "...",
        "connection_timeout_in_millis": 45000,
        "read_timeout_millis": 45000
      }
    }
  }
}

Now this works most of the time but it seems like every other trigger of this action failes because the aforementioned time out.

Every execution that says firing in this list managed to go all the way through, i.e. we got incidents created via the webhook action.

Every time that the execution failed was reported, this was the details in the watcher:

"actions": [
      {
        "id": "create_betteruptime_incident",
        "type": "webhook",
        "status": "failure",
        "error": {
          "root_cause": [
            {
              "type": "socket_timeout_exception",
              "reason": "Read timed out"
            }
          ],
          "type": "socket_timeout_exception",
          "reason": "Read timed out"
        }
      }
    ]

Am I missing something when it comes to throttling or acknowledgment? Or has this something to do with HTTP-client pooling?

It feels like I am missing something but I have not found any clear answer anywhere in the discussions here or in the documentation to what I could be doing wrong.

Thank you for any input!
Regards

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.