Watch is stuck at error, always needs a manual restart

Hi,

I'm using Elastic Stack 7.8. I've enable a watcher. Every night it shows error: connection timeout for webhook action. And the action status in But only when I edit the watch(I'm not actually changing any thing, really) and save it again in the morning, it starts working normally.
This error:

 "actions": [
  {
    "id": "webhook_1",
    "type": "webhook",
    "status": "failure",
    "error": {
      "root_cause": [
        {
          "type": "connect_timeout_exception",
          "reason": "Connect to 1.2.3.4:8081 [/1.2.3.4] failed: connect timed out"
        }
      ],
      "type": "connect_timeout_exception",
      "reason": "Connect to 1.2.3.4:8081 [/1.2.3.4] failed: connect timed out",
      "caused_by": {
        "type": "socket_timeout_exception",
        "reason": "connect timed out"
      }
    }
  }
]
  },

Followed all the other errors:

Why does it get stuck when watch encounters an error? Can't it just move on and try again later? If it requires a restart then is there a restart command, like there was in 6.8 :
POST _xpack/watcher/_restart?

I'm not looking for solution for Connection timeout, I just want to restart watch even after a connection error and try again after it is triggered again.

Thanks.

Hey,

when the watch is stuck and cannot be executed, can you run the Watcher Stats API with the emit_stacktraces parameter and share the output in a gist?

Thanks!

--Alex

Hi, Alex. Thanks for replying.

This watcher works well when I click on 'Send Request' button, while creating a test watch, and check its output. But only when it is left to execute on its own it throws: connect_timeout_exception.

This is the output of GET _watcher/stats?emit_stacktraces=true:

{
  "_nodes" : {
    "total" : 8,
    "successful" : 8,
    "failed" : 0
  },
  "cluster_name" : "abc",
  "manually_stopped" : false,
  "stats" : [
    {
      "node_id" : "lyELuRG0QAS64C93O-oqRg",
      "watcher_state" : "started",
      "watch_count" : 0,
      "execution_thread_pool" : {
        "queue_size" : 0,
        "max_size" : 1
      }
    },
    {
      "node_id" : "eEJyFdOwT3-MCKTGVx3p4w",
      "watcher_state" : "started",
      "watch_count" : 2,
      "execution_thread_pool" : {
        "queue_size" : 0,
        "max_size" : 40
      }
    },
    {
      "node_id" : "O3ezUXu2Q8qUq62scF11_g",
      "watcher_state" : "started",
      "watch_count" : 0,
      "execution_thread_pool" : {
        "queue_size" : 0,
        "max_size" : 0
      }
    },
    {
      "node_id" : "y6wGcIPQRHmhePoYtxI01Q",
      "watcher_state" : "started",
      "watch_count" : 0,
      "execution_thread_pool" : {
        "queue_size" : 0,
        "max_size" : 0
      }
    },
    {
      "node_id" : "57shI5qnS86BfgMSKaSsYg",
      "watcher_state" : "started",
      "watch_count" : 0,
      "execution_thread_pool" : {
        "queue_size" : 0,
        "max_size" : 40
      }
    },
    {
      "node_id" : "qR_AGOwJRNuBxsqPOWuRrA",
      "watcher_state" : "started",
      "watch_count" : 8,
      "execution_thread_pool" : {
        "queue_size" : 0,
        "max_size" : 40
      }
    },
    {
      "node_id" : "u9zQ0k-gS72zDbftee7OIQ",
      "watcher_state" : "started",
      "watch_count" : 0,
      "execution_thread_pool" : {
        "queue_size" : 0,
        "max_size" : 0
      }
    },
    {
      "node_id" : "wAP1CsuERLOQdECfH4y3ew",
      "watcher_state" : "started",
      "watch_count" : 0,
      "execution_thread_pool" : {
        "queue_size" : 0,
        "max_size" : 0
      }
    }
  ]
}

The watcher is working on qR_AGOwJRNuBxsqPOWuRrA node and when i checked node/stats I saw the following output:

"failures" : [
  {
    "type" : "failed_node_exception",
    "reason" : "Failed node [qR_AGOwJRNuBxsqPOWuRrA]",
    "node_id" : "qR_AGOwJRNuBxsqPOWuRrA",
    "caused_by" : {
      "type" : "translog_exception",
      "reason" : "Unable to get the earliest last modified time for the transaction log",
      "index_uuid" : "Gd5TSGq7Qoaut5hrpNocJw",
      "shard" : "0",
      "index" : ".kibana_1"
    }
  }
]

}

Could that be the reason? .kibana_1 index has 2 shards and out of them 1 has missing translog files. How do I fix that?

Thanks.

I read @s1monw's response here: Cannot recover index because of missing tanslog files but I didn't quite understand how renaming a file would help me. Also my translog file is already named: translog-11.tlog and translog.ckp.

Any help would be appreciated. Thanks!

Also, that index is about kibana. Watcher does not store any information in there.

What we can infer from the response that you pasted, is that at the time you executed that request, there are no stuck watches. Can you share the last run via the watcher history

GET .watcher-history-*/_search
{
  "query": {
    "bool": {
      "filter": [
        {
          "term": {
            "watch_id": "my_watch"
          }
        }
      ]
    }
  },
  "sort": [
    {
      "trigger_event.triggered_time": {
        "order": "desc"
      }
    }
  ]
}

You need to put in your watch ID though. This way we can see what was executed and if any errors occcured. Still wondering how it stopped executing.

Also, did the logfiles on any of your nodes reveal anything about failed watch executions?