When Watcher HTTP action fails, it reports it fails forever without ever actually retrying

Hello, so I got a watcher action that posts to an API for freshdesk (Ticket system) this works 99.9% of the time, but every now and then we receive a 502 back from Freshdesk.

If this happens, every consequtive alert, will skip the freshdesk part and claim it's failing.

2

Even on watches that should now report OK report as failing... it's quite annoying.

The only way I appear to fix is is by recreating the watch, am I doing something wrong here? The watcher config is as follows:

    "freshdesk_alert" : {
      "webhook" : {
        "method" : "POST",
        "host" : "X.freshdesk.com",
        "path" : "/api/v2/tickets",
        "scheme" : "https",
        "port" : 443,
        "body" : {
          "inline": {
            "subject": "X",
            "description": "X",
            "email": "devops@X.com",
              "priority": 3,
              "status": 2,
              "group_id" : 6000196835,
              "custom_fields" : {
                "customer": "Actual Experience",
                "it_or_devops_support_required": "Product Fault/Bug",
                "end_customer": "N/A",
                "cf_internal_logging_codes_1": "IT Incident",
                "cf_internal_logging_codes_2": "Network/Infrastructure",
                "cf_internal_logging_codes_3": "Event"
             }
          }
        },
        "auth" : {
          "basic" : {
            "username" : "{{PASSWORDADDEDBYCI}}",
            "password" : "X"
          }
        },
        "headers" : {
          "Content-Type" : "application/json"
        }
      }
    }, 

can you share the output of the watcher history index of those failures in a gist/pastebin?

GET .watcher-history-*/_search
{
  "query": {
    "bool": {
      "filter": [
        {
          "term": {
            "watch_id": "YOUR_WATCH_ID"
          }
        }
      ]
    }
  },
  "sort": [
    {
      "trigger_event.triggered_time": {
        "order": "desc"
      }
    }
  ]
}

This way we could take a closer look of why things keep failing.

Thanks!

Hey,

Original Failure:

https://pastebin.com/6zi9pkkQ

Consequitve Watches Listed As Error:

https://pastebin.com/KQ8hbtje

Watcher History(This is set to expire in 1 Day):

https://pastebin.com/FXFKHnYK

Hey,

thanks for providing, I think I know why this is happening now, from your first two snippets.

At some point in time your condition was met, the HTTP request was sent and resulted in an error. This error is stored in the watch status at status.actions.freshdesk_alert.last_execution.successful - which is false, because the last time it got executed, it failed.

The next executions always exited early, because the condition was not met, which left the above status in place - this is why an error is displayed.

Putting the watch again resets the status and thus everything is looking good until this happens again.

The good thing is, that this does not impact the execution of your watch, it just shows different in the UI.

The bad thing is, that this should probably be either fixed in the UI or that the state should be reset once the condition turns false.

Can you open an issue in the Elasticsearch repo with this data and the kibana screenshot? That would be great for the developer to take a look what would be the right thing to do!

Thanks a bunch!

--Alex

I will do that now, thank you. Is there a retry mechanism that could be introduced?

Hey,

Thanks a bunch for the issue.

I think you may be wrong on the retrying part part here. Please check all the consecutive watcher history outputs if the condition is met at one point again, but the state does not get updated. I suppose this is not the case, but I would like to be sure. The reason why I ask for this is to be sure, that the execution is not disturbed by that and we're all on the same page regarding understanding.

--Alex

By retry I meant, it the watch alarms, and HTTP fails with 502, it would be cool to add retry3 for example, it will try HTTP again and then again before finally failing.

I don't currently have any watches in this failed state, so when I it happensagain, ill ensure to trigger another alert to see if it then triggers the HTTP action. Ok?

ah, I get it. Thx for the explanation! Indeed, currently there is only always a single request being sent within an action.

As requested, once it finally does handle that HTTP output again, it works and clears the "error" state.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.