Watchers Stopped Triggering

ajrvester · July 23, 2018, 3:42pm

Elastic Cloud Version: 6.2.2

Hello,

a subset of my Watches have stopped firing. There are no reported error messages from the watches themselves, the watch history for the non-triggering watches is empty and the watches fire correctly if we simulate them. This also affects some of the built in watches, as well as the non-advanced watches we've made.

If we open and save them again it seems to resolve the issue, but I'm wondering if there is a known underlying cause for this issue that we can avoid in the future. The only lead I have is we did a cluster update around the time the watches stopped, but I'm not sure from what version.

spinscale · July 23, 2018, 3:54pm

hey,

can you check if watcher is started by checking the watcher stats and paste the output here?

--Alex

ajrvester · July 23, 2018, 4:10pm

Here is the output:

{
  "_nodes": {
    "total": 1,
    "successful": 1,
    "failed": 0
  },
  "cluster_name": "0927d1797c8a646ce539795bf4d18698",
  "manually_stopped": false,
  "stats": [
    {
      "node_id": "pGiZWe_NQi2qIV1b3khC_Q",
      "watcher_state": "started",
      "watch_count": 51,
      "execution_thread_pool": {
        "queue_size": 0,
        "max_size": 10
      }
    }
  ]
}

I forgot to note in the original post that most of the watches we have are firing correctly, but the stopped ones all stopped triggering 2 months ago.

spinscale · July 23, 2018, 4:45pm

hey,

Can you stop and start watcher and check with the watcher stats once again if everything is started?

Alternatively before doing that. Could you pick one watch, that does not get triggered currently and just store it again, and see if it gets triggered again?

I assume there is also nothing interesting in the log files? Has this been a multi node cluster at some point in time (I've never seen this so far, so super interested in more information).

Thanks a ton for helping!

--Alex

ajrvester · July 23, 2018, 5:56pm

{
  "_nodes": {
    "total": 1,
    "successful": 1,
    "failed": 0
  },
  "cluster_name": "0927d1797c8a646ce539795bf4d18698",
  "manually_stopped": false,
  "stats": [
    {
      "node_id": "pGiZWe_NQi2qIV1b3khC_Q",
      "watcher_state": "started",
      "watch_count": 66,
      "execution_thread_pool": {
        "queue_size": 0,
        "max_size": 10
      }
    }
  ]
}

Stop and start seems to have bumped the watch count up, and the previously inactive watchers are starting to fire again.

I tried storing 2 watches earlier and it seems to have made them start running again. Pretty certain this has only been a single node cluster since we launched it.

In the log files there are some watch execution failures for the built in watches. Not sure how to format this nicely, sorry:

[2018-07-23T17:52:24,980][ERROR][org.elasticsearch.xpack.watcher.transform.script.ExecutableScriptTransform] failed to execute [script] transform for [DrkweI7bRXCjKJXMoXRdhQ_elasticsearch_cluster_status_6c0232bc-b1be-49d4-92f8-cc03f77a6d53-2018-07-23T17:52:24.973Z] org.elasticsearch.script.ScriptException: runtime error at org.elasticsearch.painless.PainlessScript.convertToScriptException(PainlessScript.java:101) ~[?:?] at org.elasticsearch.painless.PainlessScript$Script.execute(ctx.vars.email_recipient = (ctx.payload.kibana_settings.hits.total > 0) ? ctx.payload.kibana_settings.hits.hits[0]._source.kibana_settings.xpack.default_admin_email : null;ctx.vars.is_new = ctx.vars.fails_check && !ctx.vars.not_resolved;ctx.vars.is_resolve ...:1070) ~[?:?] at org.elasticsearch.painless.ScriptImpl.run(ScriptImpl.java:105) ~[?:?] at org.elasticsearch.xpack.watcher.transform.script.ExecutableScriptTransform.doExecute(ExecutableScriptTransform.java:69) ~[x-pack-watcher-6.2.2.jar:6.2.2] at org.elasticsearch.xpack.watcher.transform.script.ExecutableScriptTransform.execute(ExecutableScriptTransform.java:53) ~[x-pack-watcher-6.2.2.jar:6.2.2] at org.elasticsearch.xpack.watcher.transform.script.ExecutableScriptTransform.execute(ExecutableScriptTransform.java:38) ~[x-pack-watcher-6.2.2.jar:6.2.2] at org.elasticsearch.xpack.watcher.execution.ExecutionService.executeInner(ExecutionService.java:481) ~[x-pack-watcher-6.2.2.jar:6.2.2] at org.elasticsearch.xpack.watcher.execution.ExecutionService.execute(ExecutionService.java:322) ~[x-pack-watcher-6.2.2.jar:6.2.2] at org.elasticsearch.xpack.watcher.execution.ExecutionService.lambda$executeAsync$7(ExecutionService.java:426) ~[x-pack-watcher-6.2.2.jar:6.2.2] at org.elasticsearch.xpack.watcher.execution.ExecutionService$WatchExecutionTask.run(ExecutionService.java:580) [x-pack-watcher-6.2.2.jar:6.2.2] at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:573) [elasticsearch-6.2.2.jar:6.2.2] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_144] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_144] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_144] Caused by: java.lang.IndexOutOfBoundsException: Index: 0, Size: 0 at java.util.ArrayList.rangeCheck(ArrayList.java:653) ~[?:1.8.0_144] at java.util.ArrayList.get(ArrayList.java:429) ~[?:1.8.0_144] at org.elasticsearch.painless.PainlessScript$Script.execute(ctx.vars.email_recipient = (ctx.payload.kibana_settings.hits.total > 0) ? ctx.payload.kibana_settings.hits.hits[0]._source.kibana_settings.xpack.default_admin_email : null;ctx.vars.is_new = ctx.vars.fails_check && !ctx.vars.not_resolved;ctx.vars.is_resolve ...:347) ~[?:?] ... 12 more

Thanks for the help!

spinscale · July 24, 2018, 7:24am

Hey,

Glad everything is working again. I'll take a closer at the source the next days to see if anything stands out what could have caused this.

do you happen to have a watch history entry for a watch execution of the with the id DrkweI7bRXCjKJXMoXRdhQ_elasticsearch_cluster_status around the logging timestamp?

--Alex

ajrvester · July 24, 2018, 1:28pm

Here is a pastebin with the watcher result.

A co-worker last night noted the watcher stoppage might have coincided with a configuration update we had made about 2 months ago. Last night we made a similar configuration change on a monitoring cluster we have and it seems to have resulted in a similar watcher stoppage. Fixed it with the same stop/start commands.

spinscale · July 24, 2018, 1:44pm

can you provide some more information what you did? I am curious what may have caused this, and what we can do to prevent this in the future or come up with better error messages, if it is possible to prevent his from happening to others.

ajrvester · July 24, 2018, 2:04pm

I went into the Elastic Cloud console and changed the elasticsearch.yml to add a PagerDuty integration. We just had a Slack integration configured beforehand.

The "grow and shrink" configuration change seems to have executed without errors. I didn't see anything weird in the logs while the configuration change was running.

This is what our configuration looks like now on both clusters:

xpack.notification.slack:
  account:
    monitoring:
      url: [URL]
xpack.notification.pagerduty:
  account:
    [account-name]:
      service_api_key: [api-key]

system · August 21, 2018, 2:05pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Watches Not Triggering Elasticsearch elastic-stack-alerting	3	2648	October 19, 2018
Scheduled Watches not Triggering Elasticsearch elastic-stack-alerting	10	2408	February 11, 2019
Watcher is firing twice for the same watch Elasticsearch elastic-stack-alerting	7	1874	April 2, 2018
Clearing queued watches? Elasticsearch elastic-stack-alerting	3	1776	April 27, 2017
Have watches created but the watches don't appear to be being triggered Elasticsearch elastic-stack-alerting	11	2393	July 6, 2017

Watchers Stopped Triggering

Related topics