How to fix a stuck watch and another question

alerting

(Casie Owen) #1

Hi,

In an attempt to figure out why a watch wasn't executing, I tried to manually post it using this:

POST _watcher/watch/cluster_health_watch_red/_execute

I got:

"ElasticsearchTimeoutException[failed to execute http request. timeout expired]; nested: SocketTimeoutException[connect timed out]; "

I've fixed the issue and have a watch working for this, but I can't delete the one that I tried to manually post. When I try to delete it, I get:

TimeoutException[could not delete watch [cluster_health_watch_red] within [10 seconds]... wait and try again. If this error continues to occur there is a high chance that the watch execution is stuck (either due to unresponsive external system such as an email service, or due to a bad script]; ",
"status": 500

How do I delete that watch?

Also, for a search watch, does the "query" . . . "match" mean includes or starts with or =?. For example:

PUT /_watcher/watch/log_throttling
{
"trigger" : { "schedule" : { "interval" : "60s" } },
"input" : {
"search" : {
"request" : {
"indices" : [ "logs" ],
"body" : {
"query" : {
"match" : { "message": "now throttling indexing" }
}
}
}
}
},
"condition" : {
"compare" : { "ctx.payload.hits.total" : { "gt" : 0 }}
}
}

Would that catch this in a log:

[2016-07-31 23:28:57,346][INFO ][index.engine ] [datanode] [index_2016.07.31][4] now throttling indexing: numMergesInFlight=4, maxNumMerges=3

Thanks,
Casie


(Alexander Reelsen) #2

Heya,

two things here:

First, regarding deletion: The delete watch API supports a force HTTP parameter. You can set force=true parameter. Out of curiosity. Do you see any more verbose output in the logs?

Second, regarding your query: You are trying to query something, that is actually not indexed, but written to the Elasticsearch logfile, which is not indexed into Elasticsearch (unless you did that manually, which I will just assume here). If the message field contains that, you will find this document using your mentioned query. However, the default operator is OR, so you will find any documents that contains now or thottling or indexing, see the match query docs. The match query here searches for documents that contain the specified terms. Stop thinking in terms of starts with or ends with or LIKE '%foo%', and think more of the index of a book, which contains terms that link to pages.

Hope this helps!

--Alex


(Casie Owen) #3
  1. You are super helpful, thank you!

  2. using force=true worked. I see (now) that there is also a way to increase the timeout. I didn't try that.

  3. Now I am getting "unable to find watch for record [cluster_health_watch_red_6549-2016-08-01T16:52:31.665Z], perhaps it has been deleted, ignoring..." repeating in the log. What's the best way to fix that?

  4. What's the best way to index the elasticsearch logs from all nodes? Is it possible to index logs for nodes of specific types, rather than all? Is this stuff documented somewhere? We'd like to set up some log watches in our 15 node windows/azure production environment now that we've gotten a basic familiarity in our dev environment.

  5. Re: your curiosity, here is what I saw in the elasticsearch log when trying to delete without the force parameter: "[2016-08-01 16:39:00,080][WARN ][watcher.watch ] [masternode] failed to acquire lock on watch [cluster_health_watch_red] (waited for [10 seconds]). It is possible that for some reason this watch execution is stuck"

and this is what was happening when es was trying to execute the watch:
[2016-08-01 16:39:01,315][ERROR][watcher.input.http ] [masternode] failed to execute [http] input for [org.elasticsearch.watcher.watch.Watch@51ffaaa3]
org.elasticsearch.ElasticsearchTimeoutException: failed to execute http request. timeout expired
at org.elasticsearch.watcher.support.http.HttpClient.execute(HttpClient.java:107)
at org.elasticsearch.watcher.input.http.ExecutableHttpInput.doExecute(ExecutableHttpInput.java:64)
at org.elasticsearch.watcher.input.http.ExecutableHttpInput.execute(ExecutableHttpInput.java:56)
at org.elasticsearch.watcher.input.http.ExecutableHttpInput.execute(ExecutableHttpInput.java:40)
at org.elasticsearch.watcher.execution.ExecutionService.executeInner(ExecutionService.java:347)
at org.elasticsearch.watcher.execution.ExecutionService.execute(ExecutionService.java:271)
at org.elasticsearch.watcher.execution.ExecutionService$WatchExecutionTask.run(ExecutionService.java:417)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
Caused by: java.net.SocketTimeoutException: connect timed out
at java.net.DualStackPlainSocketImpl.waitForConnect(Native Method)
at java.net.DualStackPlainSocketImpl.socketConnect(Unknown Source)
at java.net.AbstractPlainSocketImpl.doConnect(Unknown Source)
at java.net.AbstractPlainSocketImpl.connectToAddress(Unknown Source)
at java.net.AbstractPlainSocketImpl.connect(Unknown Source)
at java.net.PlainSocketImpl.connect(Unknown Source)
at java.net.SocksSocketImpl.connect(Unknown Source)
at java.net.Socket.connect(Unknown Source)
at sun.net.NetworkClient.doConnect(Unknown Source)
at sun.net.www.http.HttpClient.openServer(Unknown Source)
at sun.net.www.http.HttpClient.openServer(Unknown Source)
at sun.net.www.http.HttpClient.(Unknown Source)
at sun.net.www.http.HttpClient.New(Unknown Source)
at sun.net.www.http.HttpClient.New(Unknown Source)
at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(Unknown Source)
at sun.net.www.protocol.http.HttpURLConnection.plainConnect0(Unknown Source)
at sun.net.www.protocol.http.HttpURLConnection.plainConnect(Unknown Source)
at sun.net.www.protocol.http.HttpURLConnection.connect(Unknown Source)
at org.elasticsearch.watcher.support.http.HttpClient.doExecute(HttpClient.java:169)
at org.elasticsearch.watcher.support.http.HttpClient.execute(HttpClient.java:105)
... 9 more

I believe the problem was that I had the host value configured wrongly.

Thanks again!

Casie


(Alexander Reelsen) #4

Hey,

  1. Thanks, happy to help! :slight_smile:

  2. The timeout only means, that the execution gets waited for until the master node has processed it, I dont assume you have master node issue here, so this will not have helped.

  3. This entry should only occur once in your logs. Or do you have different entries with different timestamps mentioned? Or is it the same entry over and over again? Wild assumption here (that also explains the log message in point 5). Your watch got stuck while executing. You can check out the _watcher/stats/_all?emit_stacktraces=true endpoint to see running watches and their stack traces. I assume your watch got stuck somehow. Restarting watcher using the _watcher/restart endpoint should help here.

  4. There are a couple of ways, most preferrably you could use filebeat or logstash to index those files into elasticsearch. I'd not use a log4j logger that uses the network to directly log into elasticsearch, as I have seen those being the cause of hanging elasticsearch clusters.

  5. The exception you pasted shows a request timeout as you likely specified a host that does not exist.

Hope this helps.

--Alex


(I Gaskin) #5

Thanks @spinscale for your detailed response. I actually encountered this same problem with an ironic circumstance. The watches that were stuck were meant to send email alerts on cluster health changes. I think they inadvertently were causing master disconnects, and timeouts everywhere.


(Alexander Reelsen) #6

Hey,

stuck watches should not cause master disconnects. Did you happen to find anything in the logs backing that statement?

One problem of older watcher version was, that it did not have any timeouts configured, so the watch could get stuck, if the email server was not up/delayed. We fixed this in 2.3.2

--Alex


(system) #7