Watcher Fails or gets stuck posting to Slack

alerting

(Bwgriffith) #1

I have a watch that posts to slack as an action. When it is triggered it occasionally works (for a few days then just stops randomly), however, most of the time it fails. Any ideas?

Below is the error and stack trace

"id": "notify-slack",
"type": "slack",
"status": "failure",
"slack": {
"account": "monitoring",
"sent_messages": [
{
"status": "failure",
"reason": "SSLHandshakeException[sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target]; nested: ValidatorException[PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target]; nested: SunCertPathBuilderException[unable to find valid certification path to requested target];

Finally today it flat out got stuck in execution, stack trace is below:

{
   "watcher_state":"started",
   "watch_count":6,
   "execution_thread_pool":{
      "queue_size":0,
      "max_size":160
   },
   "current_watches":[
      {
         "watch_id":"esb_fault",
         "watch_record_id":"esb_fault_1262-2016-03-31T04:23:50.578Z",
         "triggered_time":"2016-03-31T04:23:50.578Z",
         "execution_time":"2016-03-31T04:23:50.578Z",
         "execution_phase":"actions",
         "stack_trace":[
            "java.net.SocketInputStream.socketRead0(Native Method)",
            "java.net.SocketInputStream.socketRead(SocketInputStream.java:116)",
            "java.net.SocketInputStream.read(SocketInputStream.java:170)",
            "java.net.SocketInputStream.read(SocketInputStream.java:141)",
            "sun.security.ssl.InputRecord.readFully(InputRecord.java:465)",
            "sun.security.ssl.InputRecord.read(InputRecord.java:503)",
            "sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:973)",
            "sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375)",
            "sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403)",
            "sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387)",
            "sun.net.www.protocol.https.HttpsClient.afterConnect(HttpsClient.java:559)",
            "sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.connect(AbstractDelegateHttpsURLConnection.java:185)",
            "sun.net.www.protocol.http.HttpURLConnection.getOutputStream0(HttpURLConnection.java:1283)",
            "sun.net.www.protocol.http.HttpURLConnection.getOutputStream(HttpURLConnection.java:1258)",
            "sun.net.www.protocol.https.HttpsURLConnectionImpl.getOutputStream(HttpsURLConnectionImpl.java:250)",
            "org.elasticsearch.watcher.support.http.HttpClient.doExecute(HttpClient.java:195)",
            "org.elasticsearch.watcher.support.http.HttpClient.execute(HttpClient.java:126)",
            "org.elasticsearch.watcher.actions.slack.service.SlackAccount.send(SlackAccount.java:121)",
            "org.elasticsearch.watcher.actions.slack.service.SlackAccount.send(SlackAccount.java:77)",
            "org.elasticsearch.watcher.actions.slack.ExecutableSlackAction.execute(ExecutableSlackAction.java:67)",
            "org.elasticsearch.watcher.actions.ActionWrapper.execute(ActionWrapper.java:106)",
            "org.elasticsearch.watcher.execution.ExecutionService.executeInner(ExecutionService.java:388)",
            "org.elasticsearch.watcher.execution.ExecutionService.execute(ExecutionService.java:273)",
            "org.elasticsearch.watcher.execution.ExecutionService$WatchExecutionTask.run(ExecutionService.java:438)",
            "java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)",
            "java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)",
            "java.lang.Thread.run(Thread.java:745)"
         ]
      }
   ],
   "queued_watches":[

   ],
   "manually_stopped":false
}

(Alexander Reelsen) #2

Hey,

are there any stack traces in your master node log file for the time being?

Did you change any of the http connection/readtimeouts that can be configured as a global default? Settings I talk about are watcher.http.default_connection_timeout and watcher.http.default_read_timeout

Also you can try and restart watcher using the /_watcher/restart endpoint to clear out the stuck watches.

--Alex


(system) #3