Watcher on Elastic Cloud - failed to run triggered watch due to thread pool capacity

After adding watches to our cluster, we noticed that they were mostly failing with the message:
"failed to run triggered watch [ID] due to thread pool capacity"

I don't think we can increase the thread pool capacity, but it possible I was doing it wrong.

Here are a bunch of stats. Let me know if you need more

Cluster Id: d1b954
Version 1.7.5

GET /_watcher/stats?all&pretty&emitstacktraces
{
"watcher_state": "started",
"watch_count": 2994,
"execution_thread_pool": {
"queue_size": 0,
"max_size": 20
}
}

GET /_cat/thread_pool?v
host ip bulk.active bulk.queue bulk.rejected index.active index.queue index.rejected search.active search.queue search.rejected
7f1d632f41c9 172.17.0.26 0 0 0 0 0 0 0 0 0
7dfb366da261 172.17.0.12 0 0 4299 0 0 501 0 0 0
ded7c87c27a9 172.17.0.18 1 0 0 0 0 30284 0 0 0

Shifting this to Watcher as it's not a specific Cloud question :slight_smile:

Hey,

is it possible that you are starting a lot of watches at the same time? By default watcher uses a thread pool size of 5x the number of cores plus a queue size of 1000... To get that exception you need to be above that numbers.

--Alex

We did spread them out when we loaded them up but they run every 2 minutes.

Hey,

do you have an average runtime of a watch? This would allow to calculate up if this works out or not. If you have long running watches this blocks execution and watches pile up in the queue.

Doing the rough numbers here (around 3k watches, each every two minutes), this means you run 25 watches per second, a lot more than the number of watches that can run in parallel I would assume. With a single core this would mean a single watch should take 0.2 seconds to execute in order to not create a backlog, with two cores 0.4 seconds and so forth.

--Alex

result.execution_duration for the ones that do run seem to range from 200-400 on average.

I don't know how many cores / threads I have because this is on ElasticCloud. Is there a way to find that out with a query?

@spinscale How do you suggest to space them out? We ran the PUT every second to at least get an even distribution. Our interval is as follows:

  "trigger" : {
    "schedule" : {
      "interval" : "2m"
    }
  }

We assumed that this meant 2 minutes from when it was added. Is this actually a fixed interval where everything it will trigger all alerts set for the 2m interval?

Hey,

you can use the cat API to find out

GET _cat/thread_pool?h=name,mi,ma,s,q&v

When you restart your node, the 2m is no more calculated from the last run. But from the time the watches are loaded (so all have the same start time). You can use a cron schedule to specify down to the exact second.

--Alex

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.