Watchers Stopped Triggering

Elastic Cloud Version: 6.2.2

Hello,

a subset of my Watches have stopped firing. There are no reported error messages from the watches themselves, the watch history for the non-triggering watches is empty and the watches fire correctly if we simulate them. This also affects some of the built in watches, as well as the non-advanced watches we've made.

If we open and save them again it seems to resolve the issue, but I'm wondering if there is a known underlying cause for this issue that we can avoid in the future. The only lead I have is we did a cluster update around the time the watches stopped, but I'm not sure from what version.


hey,

can you check if watcher is started by checking the watcher stats and paste the output here?

--Alex

Here is the output:

{
  "_nodes": {
    "total": 1,
    "successful": 1,
    "failed": 0
  },
  "cluster_name": "0927d1797c8a646ce539795bf4d18698",
  "manually_stopped": false,
  "stats": [
    {
      "node_id": "pGiZWe_NQi2qIV1b3khC_Q",
      "watcher_state": "started",
      "watch_count": 51,
      "execution_thread_pool": {
        "queue_size": 0,
        "max_size": 10
      }
    }
  ]
}

I forgot to note in the original post that most of the watches we have are firing correctly, but the stopped ones all stopped triggering 2 months ago.

hey,

Can you stop and start watcher and check with the watcher stats once again if everything is started?

Alternatively before doing that. Could you pick one watch, that does not get triggered currently and just store it again, and see if it gets triggered again?

I assume there is also nothing interesting in the log files? Has this been a multi node cluster at some point in time (I've never seen this so far, so super interested in more information).

Thanks a ton for helping!

--Alex

{
  "_nodes": {
    "total": 1,
    "successful": 1,
    "failed": 0
  },
  "cluster_name": "0927d1797c8a646ce539795bf4d18698",
  "manually_stopped": false,
  "stats": [
    {
      "node_id": "pGiZWe_NQi2qIV1b3khC_Q",
      "watcher_state": "started",
      "watch_count": 66,
      "execution_thread_pool": {
        "queue_size": 0,
        "max_size": 10
      }
    }
  ]
}

Stop and start seems to have bumped the watch count up, and the previously inactive watchers are starting to fire again.

I tried storing 2 watches earlier and it seems to have made them start running again. Pretty certain this has only been a single node cluster since we launched it.

In the log files there are some watch execution failures for the built in watches. Not sure how to format this nicely, sorry:

[2018-07-23T17:52:24,980][ERROR][org.elasticsearch.xpack.watcher.transform.script.ExecutableScriptTransform] failed to execute [script] transform for [DrkweI7bRXCjKJXMoXRdhQ_elasticsearch_cluster_status_6c0232bc-b1be-49d4-92f8-cc03f77a6d53-2018-07-23T17:52:24.973Z] org.elasticsearch.script.ScriptException: runtime error at org.elasticsearch.painless.PainlessScript.convertToScriptException(PainlessScript.java:101) ~[?:?] at org.elasticsearch.painless.PainlessScript$Script.execute(ctx.vars.email_recipient = (ctx.payload.kibana_settings.hits.total > 0) ? ctx.payload.kibana_settings.hits.hits[0]._source.kibana_settings.xpack.default_admin_email : null;ctx.vars.is_new = ctx.vars.fails_check && !ctx.vars.not_resolved;ctx.vars.is_resolve ...:1070) ~[?:?] at org.elasticsearch.painless.ScriptImpl.run(ScriptImpl.java:105) ~[?:?] at org.elasticsearch.xpack.watcher.transform.script.ExecutableScriptTransform.doExecute(ExecutableScriptTransform.java:69) ~[x-pack-watcher-6.2.2.jar:6.2.2] at org.elasticsearch.xpack.watcher.transform.script.ExecutableScriptTransform.execute(ExecutableScriptTransform.java:53) ~[x-pack-watcher-6.2.2.jar:6.2.2] at org.elasticsearch.xpack.watcher.transform.script.ExecutableScriptTransform.execute(ExecutableScriptTransform.java:38) ~[x-pack-watcher-6.2.2.jar:6.2.2] at org.elasticsearch.xpack.watcher.execution.ExecutionService.executeInner(ExecutionService.java:481) ~[x-pack-watcher-6.2.2.jar:6.2.2] at org.elasticsearch.xpack.watcher.execution.ExecutionService.execute(ExecutionService.java:322) ~[x-pack-watcher-6.2.2.jar:6.2.2] at org.elasticsearch.xpack.watcher.execution.ExecutionService.lambda$executeAsync$7(ExecutionService.java:426) ~[x-pack-watcher-6.2.2.jar:6.2.2] at org.elasticsearch.xpack.watcher.execution.ExecutionService$WatchExecutionTask.run(ExecutionService.java:580) [x-pack-watcher-6.2.2.jar:6.2.2] at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:573) [elasticsearch-6.2.2.jar:6.2.2] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_144] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_144] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_144] Caused by: java.lang.IndexOutOfBoundsException: Index: 0, Size: 0 at java.util.ArrayList.rangeCheck(ArrayList.java:653) ~[?:1.8.0_144] at java.util.ArrayList.get(ArrayList.java:429) ~[?:1.8.0_144] at org.elasticsearch.painless.PainlessScript$Script.execute(ctx.vars.email_recipient = (ctx.payload.kibana_settings.hits.total > 0) ? ctx.payload.kibana_settings.hits.hits[0]._source.kibana_settings.xpack.default_admin_email : null;ctx.vars.is_new = ctx.vars.fails_check && !ctx.vars.not_resolved;ctx.vars.is_resolve ...:347) ~[?:?] ... 12 more

Thanks for the help!

Hey,

Glad everything is working again. I'll take a closer at the source the next days to see if anything stands out what could have caused this.

do you happen to have a watch history entry for a watch execution of the with the id DrkweI7bRXCjKJXMoXRdhQ_elasticsearch_cluster_status around the logging timestamp?

--Alex

Here is a pastebin with the watcher result.

A co-worker last night noted the watcher stoppage might have coincided with a configuration update we had made about 2 months ago. Last night we made a similar configuration change on a monitoring cluster we have and it seems to have resulted in a similar watcher stoppage. Fixed it with the same stop/start commands.

can you provide some more information what you did? I am curious what may have caused this, and what we can do to prevent this in the future or come up with better error messages, if it is possible to prevent his from happening to others.

I went into the Elastic Cloud console and changed the elasticsearch.yml to add a PagerDuty integration. We just had a Slack integration configured beforehand.

The "grow and shrink" configuration change seems to have executed without errors. I didn't see anything weird in the logs while the configuration change was running.

This is what our configuration looks like now on both clusters:

xpack.notification.slack:
  account:
    monitoring:
      url: [URL]
xpack.notification.pagerduty:
  account:
    [account-name]:
      service_api_key: [api-key]

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.