Unable to update/delete/execute watch

After several successful executions, my watch became "stuck", in which it could no longer be executed, updated, or deleted.

The error I get back contains the below:

"TimeoutException[could not delete watch [wirestory] within [10 seconds]... wait and try again. If this error continues to occur there is a high chance that the watch execution is stuck (either due to unresponsive external system such as an email service, or due to a bad script"

When I query .watches/_search I see the below state.

{
actions: {
log: {
ack: {
timestamp: "2015-10-01T22:03:41.662Z",
state: "awaits_successful_execution"
}
},

The elasticsearch logs contain the below:

"failed to acquire lock on watch [wirestory] (waited for [10 seconds]). It is possible that for some reason this watch execution is stuck"

has anyone seen this? I have not tried restarting elasticsearch yet because that will not be an acceptable option in production.

Hi Peter,

Seems like your watch indeed got stuck. To better understand where/why, you can execute the following request:

GET _watcher/stats/_all?emit_stacktraces

This should return the execution phase and the stacktrace of the stuck watch execution thread, which will give an insight into the reason for the watch getting stuck.

In some cases (e.g. bad script) we can't do much to kill the watch execution (not at the moment at least) and you'll be forced to shutdown the node. But before you do that, it's important that you delete the "bad" watch first, so when you bring the node back it will not be loaded and executed again.

To do that, you will need to force delete the watch before shutting down the node. You can do that with the following request:

DELETE _watcher/my_watch?force

Thank you, here is the result of the stack trace. It appears to have gotten stuck reading the response from the webhook, but I am not sure how that is possible given the webhook action should be using the default timeout values of 10 seconds.

What is most odd here is that this watch works for a while, and then it eventually gets stuck.

"java.net.SocketInputStream.socketRead0(Native Method)",
"java.net.SocketInputStream.read(SocketInputStream.java:152)",
"java.net.SocketInputStream.read(SocketInputStream.java:122)",
"java.io.BufferedInputStream.fill(BufferedInputStream.java:235)",
"java.io.BufferedInputStream.read1(BufferedInputStream.java:275)",
"java.io.BufferedInputStream.read(BufferedInputStream.java:334)",
"sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:687)",
"sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:633)",
"sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1324)",
"java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:468)",
"org.elasticsearch.watcher.support.http.HttpClient.doExecute(HttpClient.java:171)",
"org.elasticsearch.watcher.support.http.HttpClient.execute(HttpClient.java:105)",
"org.elasticsearch.watcher.actions.webhook.ExecutableWebhookAction.execute(ExecutableWebhookAction.java:57)",
"org.elasticsearch.watcher.actions.ActionWrapper.execute(ActionWrapper.java:104)",
"org.elasticsearch.watcher.execution.ExecutionService.executeInner(ExecutionService.java:379)",
"org.elasticsearch.watcher.execution.ExecutionService.execute(ExecutionService.java:271)",
"org.elasticsearch.watcher.execution.ExecutionService$WatchExecutionTask.run(ExecutionService.java:417)",
"java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)",
"java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)",
"java.lang.Thread.run(Thread.java:745)"

Ok, the problem is that the webhook action's read_timeout does not work. I can reproduce the issue by forcing the service handling the webhook call to wait. I would expect the webhook's read_timeout to abort the request but it does not, it waits indefinitely.

this is strange... I tried to reproduce but the "read_timeout" seems to be working for me. Can you share your configuration?

Sure, I've tried with and without the timeouts configured.

"webhook": {
"scheme": "http",
"host": "redacted",
"port": 80,
"method": "post",
"connection_timeout": "3s",
"read_timeout": "5s",
"path": "/api/watchers/wirestory",
"params": {
},
"headers": {
"Content-Type": "application/json"
},
"body": "{{ctx.payload.results}}"
}

looks good... and how often does this watch execute? did you look at the watch history? would be interesting to run an "experiment", where you set the the schedule to be triggered every 10000s, and then use the _execute API to test the watch and its read timeout setting in isolation.

I had it execute once a minute initially, but once it gets hung it won't run anymore (confirmed by the watch history), and the _execute API fails once it is hung.

I have worked around the problem by ensuring the service being called has a timeout configured so that it kills the request if it takes to long, which unblocks the watch....not ideal, but at least gives me a path forward.

ok... I'll try to dig further into it, run some more tests and see if I can repro. thx for the feedback!

I am having a similar issue. My webhook handler was stuck while debugging.
Now the watch is stuck and i cannot delete it. Any ideas on how to recover from this situation.
I get error TimeoutException[could not delete watch [mywatchname] within [10 seconds]... wait and try again. If this error continues to occur there is a high chance that the watch execution is stuck (either due to unresponsive external system such as an email service, or due to a bad script]",

@prashant_sharma

Unfortunately, there is no task management API, so if the thread is legitimately stuck, then it's going to remain stuck. Naturally, if it follows from above with the webhook timeout, then it should be working (but maybe isn't).

It would be interesting to see the hot_threads output for the elected master node (and therefore the node running the watches!) to try to see where it's stuck, like Peter Walsh did.

In terms of a workaround, as it is "just" a master node, you can hopefully restart the node without much concern because another master node will take its place. This will kill the thread and it should allow you to delete the watch as this happens. (Note: if your node is also a data node, then this complicates the suggestion, but it should be a usable workaround)

Here is the hot threads dump.

::: [Power Princess][CIO9YQtlTguEHxavNFK6jw][psharma][inet[/XXX:9300]]
Hot threads at 2015-12-09T19:57:11.203Z, interval=500ms, busiestThreads=10000, ignoreIdleThreads=true:

0.0% (191.1micros out of 500ms) cpu usage by thread 'elasticsearch[Power Princess][transport_client_timer][T#1]{Hashed wheel timer #1}'
 10/10 snapshots sharing following 5 elements
   java.lang.Thread.sleep(Native Method)
   org.elasticsearch.common.netty.util.HashedWheelTimer$Worker.waitForNextTick(HashedWheelTimer.java:445)
   org.elasticsearch.common.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:364)
   org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
   java.lang.Thread.run(Thread.java:745)

0.0% (122.9micros out of 500ms) cpu usage by thread 'ticker-schedule-trigger-engine'
 10/10 snapshots sharing following 2 elements
   java.lang.Thread.sleep(Native Method)
   org.elasticsearch.watcher.trigger.schedule.engine.TickerScheduleTriggerEngine$Ticker.run(TickerScheduleTriggerEngine.java:160)

0.0% (0s out of 500ms) cpu usage by thread 'elasticsearch[keepAlive/1.5.2]'
 10/10 snapshots sharing following 8 elements
   sun.misc.Unsafe.park(Native Method)
   java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
   java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)

Not much I can see in the hot threads output, can you share the watcher stats output?
http://localhost:9200/_watcher/stats/_all?human&pretty&emit_stacktraces

i truncated the results as they have the same watch in queue. Looking at this the Queue size is 1000 and before the queue is flushed you cannot update the watch instance? is this the problem?

{
"watcher_state" : "started",
"watch_count" : 1,
"execution_thread_pool" : {
"queue_size" : 1000,
"max_size" : 10
},
"current_watches" : [ {
"watch_id" : "emprecPolicy",
"watch_record_id" : "emprecPolicy_3-2015-12-09T22:03:20.720Z",
"triggered_time" : "2015-12-09T09:32:02.605Z",
"execution_time" : "2015-12-09T22:03:20.720Z",
"execution_phase" : "actions",
"executed_actions" : ,
"stack_trace" : [ "java.net.SocketInputStream.socketRead0(Native Method)", "java.net.SocketInputStream.socketRead(SocketInputStream.java:116)", "java.net.SocketInputStream.read(SocketInputStream.java:170)", "java.net.SocketInputStream.read(SocketInputStream.java:141)", "java.io.BufferedInputStream.fill(BufferedInputStream.java:246)", "java.io.BufferedInputStream.read1(BufferedInputStream.java:286)", "java.io.BufferedInputStream.read(BufferedInputStream.java:345)", "sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:704)", "sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:647)", "sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1535)", "sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1440)", "java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:480)", "org.elasticsearch.watcher.support.http.HttpClient.doExecute(HttpClient.java:171)", "org.elasticsearch.watcher.support.http.HttpClient.execute(HttpClient.java:105)", "org.elasticsearch.watcher.actions.webhook.ExecutableWebhookAction.execute(ExecutableWebhookAction.java:57)", "org.elasticsearch.watcher.actions.ActionWrapper.execute(ActionWrapper.java:104)", "org.elasticsearch.watcher.execution.ExecutionService.executeInner(ExecutionService.java:379)", "org.elasticsearch.watcher.execution.ExecutionService.execute(ExecutionService.java:271)", "org.elasticsearch.watcher.execution.ExecutionService$WatchExecutionTask.run(ExecutionService.java:417)", "java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)", "java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)", "java.lang.Thread.run(Thread.java:745)" ]
} ],
"queued_watches" : [ {
"watch_id" : "emprecPolicy",
"watch_record_id" : "emprecPolicy_12-2015-12-09T22:03:20.722Z",
"triggered_time" : "2015-12-09T09:36:32.677Z",
"execution_time" : "2015-12-09T22:03:20.722Z"
}, {
"watch_id" : "emprecPolicy",
"watch_record_id" : "emprecPolicy_13-2015-12-09T22:03:20.722Z",
"triggered_time" : "2015-12-09T09:37:02.685Z",
"execution_time" : "2015-12-09T22:03:20.722Z"
}
}

This clearly shows that Watch execution with id 'emprecPolicy_3-2015-12-09T22:03:20.720Z' is busy with reading from a socket which is triggered by web hook action.

Is watch execution 'emprecPolicy_3-2015-12-09T22:03:20.720Z' always stuck like this?
Or do other watch executions also appear here? Perhaps the read time out isn't working correctly here.

Like was discussed earlier here, a watch can always be deleted if the force query string parameter is used on the delete watch api. But if watch execution 'emprecPolicy_3-2015-12-09T22:03:20.720Z' is really stuck here then you also need to restart the node that where Watcher is running.

Hi Martjin,
I believe the Watch Stuck state is persisted between restarts. i had to kill the ES process to get instance restarted. Force delete did not work for me. I am trying to reproduce this scenario with ES2.0 and Watcher2.1

Prashant