Watches Not Triggering

Elasticsearch & Kibana version: 6.2.4

Hi,

This morning we noticed that several (but not all) of our watches were not triggering and had no watcher history. In particular, one watch should have been firing alerts off but was totally silent. Using POST _xpack/watch/{watch_id}/_execute to manually trigger the alert caused it to fire.

Stopping and starting watcher with POST _xpack/watcher/_stop & POST _xpack/watcher/_start caused all watches to resume triggering. Watcher histories are being recorded again as well.

This seems similar to the issue encountered here, and I am wondering if this is a known bug in X-Pack? Watches are working as expected now, but without figuring out what caused this I can't be sure it won't happen again.

Here are some details on the cluster:

Five nodes running Elasticsearch 6.2.4, one 6.2.4 Kibana instance, one 6.2.4 Logstash instance. We upgraded from 3 nodes to a 5 node cluster a few weeks ago, I am not sure if this coincides with when the watches stopped.

Result of GET_xpack/watcher/stats/ (done after restarting Watcher, not sure how much help it will be):

{
  "_nodes": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "cluster_name": "clustername",
  "manually_stopped": false,
  "stats": [
    {
      "node_id": "L8C9e9FgSTulsHHs7vEqLQ",
      "watcher_state": "started",
      "watch_count": 12,
      "execution_thread_pool": {
        "queue_size": 0,
        "max_size": 20
      }
    },
    {
      "node_id": "fdKsVvaFQ9i6er6ck5eoVQ",
      "watcher_state": "started",
      "watch_count": 15,
      "execution_thread_pool": {
        "queue_size": 0,
        "max_size": 20
      }
    },
    {
      "node_id": "RskSddGqQ26Doz6y4mmTsg",
      "watcher_state": "started",
      "watch_count": 0,
      "execution_thread_pool": {
        "queue_size": 0,
        "max_size": 0
      }
    },
    {
      "node_id": "C2tmZvL8QbWZ1CRZCHGWPQ",
      "watcher_state": "started",
      "watch_count": 0,
      "execution_thread_pool": {
        "queue_size": 0,
        "max_size": 20
      }
    },
    {
      "node_id": "J043eFCoSV65Pd5vT9XERQ",
      "watcher_state": "started",
      "watch_count": 0,
      "execution_thread_pool": {
        "queue_size": 0,
        "max_size": 0
      }
    }
  ]
}

elasticsearch.yml:

cluster.name: clustername
node.name: ch-elas-epm01-p
path.data: /data/elasticsearch/data
network.host: 0.0.0.0
bootstrap.memory_lock: true
http.port: 9200
xpack.security.enabled: false
thread_pool.search.queue_size: 10000
discovery.zen.ping.unicast.hosts: ["ch-elas-epm01-p","ch-elas-epm02-p","ch-elas-epm03-p","ch-elas-epm06-p","ch-elas-epm07-p"]
discovery.zen.minimum_master_nodes: 3
path.logs: /data/elasticsearch/logs
action.destructive_requires_name: true
xpack.notification.email:
    account:
        smtp_acct:
            profile: standard
            smtp.host: smtpserverhost
            smtp.auth: false
            smtp.user: no-reply@companydomain.com
    default_account: smtp_acct

Logs:

When looking through the Elasticsearch logs I found logs that the watches failed to trigger along with exceptions:


...

[2018-09-19T18:40:52,198][WARN ][o.e.x.w.e.ExecutionService] [ch-elas-epm01-p] failed to execute watch [t_cfCSv2RYaRWtPmFD1lkg_elasticsearch_nodes]

[2018-09-19T19:01:37,673][ERROR][o.e.x.w.e.ExecutionService] [ch-elas-epm01-p] failed to update watch record [t_cfCSv2RYaRWtPmFD1lkg_elasticsearch_nodes_fd7af7c7-7e60-4aa6-ab70-a323e1b51838-2018-09-20T00:00:52.501Z]

org.elasticsearch.ElasticsearchTimeoutException: java.util.concurrent.TimeoutException: Timeout waiting for task.

at org.elasticsearch.action.support.AdapterActionFuture.actionGet(AdapterActionFuture.java:68) ~[elasticsearch-6.2.4.jar:6.2.4]

at org.elasticsearch.xpack.watcher.history.HistoryStore.put(HistoryStore.java:97) ~[x-pack-watcher-6.2.4.jar:6.2.4]

at org.elasticsearch.xpack.watcher.execution.ExecutionService.execute(ExecutionService.java:337) ~[x-pack-watcher-6.2.4.jar:6.2.4]

at org.elasticsearch.xpack.watcher.execution.ExecutionService.lambda$executeAsync$6(ExecutionService.java:421) ~[x-pack-watcher-6.2.4.jar:6.2.4]

at org.elasticsearch.xpack.watcher.execution.ExecutionService$WatchExecutionTask.run(ExecutionService.java:575) [x-pack-watcher-6.2.4.jar:6.2.4]

at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:573) [elasticsearch-6.2.4.jar:6.2.4]

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_131]

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_131]

at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131]

Caused by: java.util.concurrent.TimeoutException: Timeout waiting for task.

at org.elasticsearch.common.util.concurrent.BaseFuture$Sync.get(BaseFuture.java:235) ~[elasticsearch-6.2.4.jar:6.2.4]

at org.elasticsearch.common.util.concurrent.BaseFuture.get(BaseFuture.java:69) ~[elasticsearch-6.2.4.jar:6.2.4]

at org.elasticsearch.action.support.AdapterActionFuture.actionGet(AdapterActionFuture.java:66) ~[elasticsearch-6.2.4.jar:6.2.4]

... 8 more

[2018-09-19T19:01:37,702][ERROR][o.e.x.w.e.ExecutionService] [ch-elas-epm01-p] failed to update watch record [t_cfCSv2RYaRWtPmFD1lkg_xpack_license_expiration_2ddded28-df9e-4033-8588-95c4743b985a-2018-09-20T00:00:47.466Z]

org.elasticsearch.ElasticsearchTimeoutException: java.util.concurrent.TimeoutException: Timeout waiting for task.

at org.elasticsearch.action.support.AdapterActionFuture.actionGet(AdapterActionFuture.java:68) ~[elasticsearch-6.2.4.jar:6.2.4]

at org.elasticsearch.xpack.watcher.history.HistoryStore.put(HistoryStore.java:97) ~[x-pack-watcher-6.2.4.jar:6.2.4]

at org.elasticsearch.xpack.watcher.execution.ExecutionService.execute(ExecutionService.java:337) ~[x-pack-watcher-6.2.4.jar:6.2.4]

at org.elasticsearch.xpack.watcher.execution.ExecutionService.lambda$executeAsync$6(ExecutionService.java:421) ~[x-pack-watcher-6.2.4.jar:6.2.4]

at org.elasticsearch.xpack.watcher.execution.ExecutionService$WatchExecutionTask.run(ExecutionService.java:575) [x-pack-watcher-6.2.4.jar:6.2.4]

at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:573) [elasticsearch-6.2.4.jar:6.2.4]

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_131]

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_131]

at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131]

Caused by: java.util.concurrent.TimeoutException: Timeout waiting for task.

at org.elasticsearch.common.util.concurrent.BaseFuture$Sync.get(BaseFuture.java:235) ~[elasticsearch-6.2.4.jar:6.2.4]

at org.elasticsearch.common.util.concurrent.BaseFuture.get(BaseFuture.java:69) ~[elasticsearch-6.2.4.jar:6.2.4]

at org.elasticsearch.action.support.AdapterActionFuture.actionGet(AdapterActionFuture.java:66) ~[elasticsearch-6.2.4.jar:6.2.4]

... 8 more

[2018-09-19T19:01:37,721][ERROR][o.e.x.w.e.ExecutionService] [ch-elas-epm01-p] failed to update watch record [cdm-dead-high-depth_ecedbc9b-7c02-485b-a101-52985710d404-2018-09-20T00:00:37.461Z]

org.elasticsearch.ElasticsearchTimeoutException: java.util.concurrent.TimeoutException: Timeout waiting for task.

at org.elasticsearch.action.support.AdapterActionFuture.actionGet(AdapterActionFuture.java:68) ~[elasticsearch-6.2.4.jar:6.2.4]

at org.elasticsearch.xpack.watcher.history.HistoryStore.put(HistoryStore.java:97) ~[x-pack-watcher-6.2.4.jar:6.2.4]

at org.elasticsearch.xpack.watcher.execution.ExecutionService.execute(ExecutionService.java:337) ~[x-pack-watcher-6.2.4.jar:6.2.4]

at org.elasticsearch.xpack.watcher.execution.ExecutionService.lambda$executeAsync$6(ExecutionService.java:421) ~[x-pack-watcher-6.2.4.jar:6.2.4]

at org.elasticsearch.xpack.watcher.execution.ExecutionService$WatchExecutionTask.run(ExecutionService.java:575) [x-pack-watcher-6.2.4.jar:6.2.4]

at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:573) [elasticsearch-6.2.4.jar:6.2.4]

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_131]

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_131]

at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131]

Caused by: java.util.concurrent.TimeoutException: Timeout waiting for task.

at org.elasticsearch.common.util.concurrent.BaseFuture$Sync.get(BaseFuture.java:235) ~[elasticsearch-6.2.4.jar:6.2.4]

at org.elasticsearch.common.util.concurrent.BaseFuture.get(BaseFuture.java:69) ~[elasticsearch-6.2.4.jar:6.2.4]

at org.elasticsearch.action.support.AdapterActionFuture.actionGet(AdapterActionFuture.java:66) ~[elasticsearch-6.2.4.jar:6.2.4]

... 8 more

...

Hey,

there have been two bugfixes in that area, one fixed in 6.3.0 and one fixed in 6.4.1, so I'd highly recommend an upgrade to 6.4.1. The bug can be read about it https://github.com/elastic/elasticsearch/pull/33167

Short version: Reload was not happening properly, when .watches shards were being moved around in the cluster. A reload basically did a reinit, which would explain the behaviour you are seeing above.

Regarding your second post: This would not have stopped execution of the watch itself. However the history entry does not seem to get indexed, as it failed to get indexed within 30s. Please monitor this and ping us, if you see this again. I'd also be interested in the thread pool nodes stats and nodes info, when this is happening.

Thanks a lot!

--Alex

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.