Watcher Troubleshooting

Greetings all,
For this post, I am seeking advice on how to go about troubleshooting watcher configurations. I have 2 different use cases that I am seeking to use watcher for and so far I have run in to a few challenges that I am not sure how to deal with. My trouble is not with the syntax of the watcher config. I am able to review documentation and solve those issues so please do not focus on the syntax in the following examples.

  1. As I iterate through watcher configs and attempt to use the Kibana UI or the watcher execute API, I am getting a very non descriptive error. " Cannot simulate watch - An Internal error occurred" or "{"statusCode":502,"error":"Bad Gateway","message":"Client request timeout"}". Both of theses are not much help. The question here is, Where should I look for more information as to why that is the result I am getting?
  2. One watcher config that I created will trigger either via schedule or via execute API and enters the firing state but does not seem to be progressing. Any thoughts on how to troubleshoot this issue? ( I am including a sanitized snippet from the watcher status page below)
    {
  "watch_id": "WATCHID",
  "node": "WPv6z9KOSXaJIOQz80GfUg",
  "state": "executed",
  "user": "USERNAME",
  "status": {
    "state": {
      "active": true,
      "timestamp": "2020-09-04T18:15:08.840Z"
    },
    "last_checked": "2020-09-04T18:15:34.962Z",
    "last_met_condition": "2020-09-04T18:15:34.962Z",
    "actions": {
      "ACTIONID": {
        "ack": {
          "timestamp": "2020-09-04T18:15:34.962Z",
          "state": "ackable"
        },
        "last_execution": {
          "timestamp": "2020-09-04T18:15:34.962Z",
          "successful": true
        },
        "last_successful_execution": {
          "timestamp": "2020-09-04T18:15:34.962Z",
          "successful": true
        }
      }
    },
    "execution_state": "executed",
    "version": 1
  },
  "trigger_event": {
    "type": "manual",
    "triggered_time": "2020-09-04T18:15:34.962Z",
    "manual": {
      "schedule": {
        "scheduled_time": "2020-09-04T18:15:34.962Z"
      }
    }
  },
  "input": {
    "search": {
      "request": {
        "search_type": "query_then_fetch",
        "indices": [
          "INDEXPATTERN"
        ],
        "rest_total_hits_as_int": true,
        "body": {
          "size": 0,
          "query": {
            "bool": {
              "filter": {
                "range": {
                  "@timestamp": {
                    "gt": "now-9h",
                    "lt": "now"
                  }
                }
              }
            }
          }
        }
      }
    }
  },
  "condition": {
    "compare": {
      "ctx.payload.hits.total": {
        "gte": 1
      }
    }
  },
  "metadata": {
    "Dashboard_Link": "DASHBOARD URL"
  },
  "result": {
    "execution_time": "2020-09-04T18:15:34.962Z",
    "execution_duration": 300244,
    "input": {
      "type": "search",
      "status": "success",
      "payload": {
        "_shards": {
          "total": 2,
          "failed": 0,
          "successful": 2,
          "skipped": 0
        },
        "hits": {
          "hits": [],
          "total": 1,
          "max_score": null
        },
        "took": 1,
        "timed_out": false
      },
      "search": {
        "request": {
          "search_type": "query_then_fetch",
          "indices": [
            "INDEXPATTERN"
          ],
          "rest_total_hits_as_int": true,
          "body": {
            "size": 0,
            "query": {
              "bool": {
                "filter": {
                  "range": {
                    "@timestamp": {
                      "gt": "now-9h",
                      "lt": "now"
                    }
                  }
                }
              }
            }
          }
        }
      }
    },
    "condition": {
      "type": "compare",
      "status": "success",
      "met": true,
      "compare": {
        "resolved_values": {
          "ctx.payload.hits.total": 1
        }
      }
    },
    "transform": {
      "type": "script",
      "status": "success",
      "payload": {
        "result": 1
      }
    },
    "actions": [
      {
        "id": "ACTIONID",
        "type": "email",
        "status": "success",
        "email": {
          "account": "MAILSYSTEM",
          "message": {
            "id": "WATCHERID",
            "from": "TESTINGEMAIL@DOMAIN.COM",
            "sent_date": "2020-09-04T18:20:35.142866Z",
            "to": [
              "TESTINGEMAIL@DOMAIN.COM"
            ],
            "subject": "SUBJECT",
            "body": {
              "text": "BODY TEXT, Dashboard can be viewed here: LINK TO RELATED DASHBOARD"
            }
          }
        }
      }
    ]
  },
  "messages": []
}

For those who see this shortly after posting and before the weekend, Have a good Labor day weekend.
-Vince

Hey,

there is a longer blog post for watcher debugging at https://www.elastic.co/blog/watching-the-watches-writing-debugging-and-testing-watches that might help you.

Regarding your two cases:

  1. Are you trying to create a report in that watch when this error message gets thrown?
  2. Which state do you expect the watch to be in and why? The output looks as if an email has been sent, was this the indended result or not?

Alexander,
I have already read that article and did not find a solution within it.

The goal of this watch is to look for at least X occurrences of a certain log entry and then generate a PDF of a specific dashboard to send via email. I have this configuration working for a different data source so I know it is possible. I have received 0 email from this watcher config despite the execution states showing OK or Firing. This is NOT the intended result, just to be clear.

The email action stats success, so it has been delivered successfully to the SMTP server. Can you use a debugging SMTP server like maildev to check if that email is received? Do you have insight into that mailserver and can debug the issue further there?

The result of this thread was that I had mistakenly typed the wrong domain name in the email address. Watcher was doing its job; this was operator error.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.