Watcher on Elastic Cloud - failed to run triggered watch due to thread pool capacity

jakehschwartz · February 10, 2017, 10:43pm

After adding watches to our cluster, we noticed that they were mostly failing with the message:
"failed to run triggered watch [ID] due to thread pool capacity"

I don't think we can increase the thread pool capacity, but it possible I was doing it wrong.

Here are a bunch of stats. Let me know if you need more

Cluster Id: d1b954
Version 1.7.5

GET /_watcher/stats?all&pretty&emitstacktraces
{
"watcher_state": "started",
"watch_count": 2994,
"execution_thread_pool": {
"queue_size": 0,
"max_size": 20
}
}

GET /_cat/thread_pool?v
host ip bulk.active bulk.queue bulk.rejected index.active index.queue index.rejected search.active search.queue search.rejected
7f1d632f41c9 172.17.0.26 0 0 0 0 0 0 0 0 0
7dfb366da261 172.17.0.12 0 0 4299 0 0 501 0 0 0
ded7c87c27a9 172.17.0.18 1 0 0 0 0 30284 0 0 0

warkolm · February 10, 2017, 11:27pm

Shifting this to Watcher as it's not a specific Cloud question

spinscale · February 11, 2017, 5:22pm

Hey,

is it possible that you are starting a lot of watches at the same time? By default watcher uses a thread pool size of 5x the number of cores plus a queue size of 1000... To get that exception you need to be above that numbers.

--Alex

jakehschwartz · February 12, 2017, 12:51am

We did spread them out when we loaded them up but they run every 2 minutes.

spinscale · February 12, 2017, 11:18am

Hey,

do you have an average runtime of a watch? This would allow to calculate up if this works out or not. If you have long running watches this blocks execution and watches pile up in the queue.

Doing the rough numbers here (around 3k watches, each every two minutes), this means you run 25 watches per second, a lot more than the number of watches that can run in parallel I would assume. With a single core this would mean a single watch should take 0.2 seconds to execute in order to not create a backlog, with two cores 0.4 seconds and so forth.

--Alex

jakehschwartz · February 13, 2017, 5:08pm

result.execution_duration for the ones that do run seem to range from 200-400 on average.

I don't know how many cores / threads I have because this is on ElasticCloud. Is there a way to find that out with a query?

jakehschwartz · February 13, 2017, 7:25pm

@spinscale How do you suggest to space them out? We ran the PUT every second to at least get an even distribution. Our interval is as follows:

  "trigger" : {
    "schedule" : {
      "interval" : "2m"
    }
  }

We assumed that this meant 2 minutes from when it was added. Is this actually a fixed interval where everything it will trigger all alerts set for the 2m interval?

spinscale · February 15, 2017, 10:39am

Hey,

you can use the cat API to find out

GET _cat/thread_pool?h=name,mi,ma,s,q&v

When you restart your node, the 2m is no more calculated from the last run. But from the time the watches are loaded (so all have the same start time). You can use a cron schedule to specify down to the exact second.

--Alex

system · February 18, 2017, 10:40am

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Watcher throwing thread pool capacity error Elasticsearch elastic-stack-alerting	11	8233	July 6, 2017
Performance questions - How many watches can be defined? Elasticsearch elastic-stack-alerting	2	1050	July 6, 2017
Watches Not Triggering Elasticsearch elastic-stack-alerting	3	2689	October 19, 2018
Scaling watchers to monitor a large number of environments Elasticsearch elastic-stack-monitoring , elastic-stack-alerting	1	381	June 26, 2019
Watching watcher Elasticsearch elastic-stack-alerting	4	1169	July 6, 2017

Watcher on Elastic Cloud - failed to run triggered watch due to thread pool capacity

Related topics