Watch is stuck at error, always needs a manual restart

Shubhangi · September 24, 2020, 8:59am

Hi,

I'm using Elastic Stack 7.8. I've enable a watcher. Every night it shows error: connection timeout for webhook action. And the action status in But only when I edit the watch(I'm not actually changing any thing, really) and save it again in the morning, it starts working normally.
This error:

 "actions": [
  {
    "id": "webhook_1",
    "type": "webhook",
    "status": "failure",
    "error": {
      "root_cause": [
        {
          "type": "connect_timeout_exception",
          "reason": "Connect to 1.2.3.4:8081 [/1.2.3.4] failed: connect timed out"
        }
      ],
      "type": "connect_timeout_exception",
      "reason": "Connect to 1.2.3.4:8081 [/1.2.3.4] failed: connect timed out",
      "caused_by": {
        "type": "socket_timeout_exception",
        "reason": "connect timed out"
      }
    }
  }
]
  },

Followed all the other errors:

Why does it get stuck when watch encounters an error? Can't it just move on and try again later? If it requires a restart then is there a restart command, like there was in 6.8 :
POST _xpack/watcher/_restart?

I'm not looking for solution for Connection timeout, I just want to restart watch even after a connection error and try again after it is triggered again.

Thanks.

spinscale · September 24, 2020, 9:46am

Hey,

when the watch is stuck and cannot be executed, can you run the Watcher Stats API with the emit_stacktraces parameter and share the output in a gist?

Thanks!

--Alex

Shubhangi · September 26, 2020, 6:30pm

Hi, Alex. Thanks for replying.

This watcher works well when I click on 'Send Request' button, while creating a test watch, and check its output. But only when it is left to execute on its own it throws: connect_timeout_exception.

This is the output of GET _watcher/stats?emit_stacktraces=true:

{
  "_nodes" : {
    "total" : 8,
    "successful" : 8,
    "failed" : 0
  },
  "cluster_name" : "abc",
  "manually_stopped" : false,
  "stats" : [
    {
      "node_id" : "lyELuRG0QAS64C93O-oqRg",
      "watcher_state" : "started",
      "watch_count" : 0,
      "execution_thread_pool" : {
        "queue_size" : 0,
        "max_size" : 1
      }
    },
    {
      "node_id" : "eEJyFdOwT3-MCKTGVx3p4w",
      "watcher_state" : "started",
      "watch_count" : 2,
      "execution_thread_pool" : {
        "queue_size" : 0,
        "max_size" : 40
      }
    },
    {
      "node_id" : "O3ezUXu2Q8qUq62scF11_g",
      "watcher_state" : "started",
      "watch_count" : 0,
      "execution_thread_pool" : {
        "queue_size" : 0,
        "max_size" : 0
      }
    },
    {
      "node_id" : "y6wGcIPQRHmhePoYtxI01Q",
      "watcher_state" : "started",
      "watch_count" : 0,
      "execution_thread_pool" : {
        "queue_size" : 0,
        "max_size" : 0
      }
    },
    {
      "node_id" : "57shI5qnS86BfgMSKaSsYg",
      "watcher_state" : "started",
      "watch_count" : 0,
      "execution_thread_pool" : {
        "queue_size" : 0,
        "max_size" : 40
      }
    },
    {
      "node_id" : "qR_AGOwJRNuBxsqPOWuRrA",
      "watcher_state" : "started",
      "watch_count" : 8,
      "execution_thread_pool" : {
        "queue_size" : 0,
        "max_size" : 40
      }
    },
    {
      "node_id" : "u9zQ0k-gS72zDbftee7OIQ",
      "watcher_state" : "started",
      "watch_count" : 0,
      "execution_thread_pool" : {
        "queue_size" : 0,
        "max_size" : 0
      }
    },
    {
      "node_id" : "wAP1CsuERLOQdECfH4y3ew",
      "watcher_state" : "started",
      "watch_count" : 0,
      "execution_thread_pool" : {
        "queue_size" : 0,
        "max_size" : 0
      }
    }
  ]
}

The watcher is working on qR_AGOwJRNuBxsqPOWuRrA node and when i checked node/stats I saw the following output:

"failures" : [
  {
    "type" : "failed_node_exception",
    "reason" : "Failed node [qR_AGOwJRNuBxsqPOWuRrA]",
    "node_id" : "qR_AGOwJRNuBxsqPOWuRrA",
    "caused_by" : {
      "type" : "translog_exception",
      "reason" : "Unable to get the earliest last modified time for the transaction log",
      "index_uuid" : "Gd5TSGq7Qoaut5hrpNocJw",
      "shard" : "0",
      "index" : ".kibana_1"
    }
  }
]

}

Could that be the reason? .kibana_1 index has 2 shards and out of them 1 has missing translog files. How do I fix that?

Thanks.

Shubhangi · September 26, 2020, 8:00pm

I read @s1monw's response here: Cannot recover index because of missing tanslog files but I didn't quite understand how renaming a file would help me. Also my translog file is already named: translog-11.tlog and translog.ckp.

Any help would be appreciated. Thanks!

spinscale · September 28, 2020, 8:13am

Also, that index is about kibana. Watcher does not store any information in there.

What we can infer from the response that you pasted, is that at the time you executed that request, there are no stuck watches. Can you share the last run via the watcher history

GET .watcher-history-*/_search
{
  "query": {
    "bool": {
      "filter": [
        {
          "term": {
            "watch_id": "my_watch"
          }
        }
      ]
    }
  },
  "sort": [
    {
      "trigger_event.triggered_time": {
        "order": "desc"
      }
    }
  ]
}

You need to put in your watch ID though. This way we can see what was executed and if any errors occcured. Still wondering how it stopped executing.

Also, did the logfiles on any of your nodes reveal anything about failed watch executions?

system · October 26, 2020, 8:13am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Unable to update/delete/execute watch Elasticsearch elastic-stack-alerting	16	4024	July 6, 2017
How to fix a stuck watch and another question Elasticsearch elastic-stack-alerting	6	6556	July 6, 2017
Watches Not Triggering Elasticsearch elastic-stack-alerting	3	2692	October 19, 2018
Failed to start watcher. please wait for the cluster to become ready or try to start Watcher manully Elasticsearch elastic-stack-alerting	8	4032	July 6, 2017
Watcher Service not working Elasticsearch	5	1560	August 21, 2018

Watch is stuck at error, always needs a manual restart

Related topics