After adding watches to our cluster, we noticed that they were mostly failing with the message:
"failed to run triggered watch [ID] due to thread pool capacity"
I don't think we can increase the thread pool capacity, but it possible I was doing it wrong.
Here are a bunch of stats. Let me know if you need more
is it possible that you are starting a lot of watches at the same time? By default watcher uses a thread pool size of 5x the number of cores plus a queue size of 1000... To get that exception you need to be above that numbers.
do you have an average runtime of a watch? This would allow to calculate up if this works out or not. If you have long running watches this blocks execution and watches pile up in the queue.
Doing the rough numbers here (around 3k watches, each every two minutes), this means you run 25 watches per second, a lot more than the number of watches that can run in parallel I would assume. With a single core this would mean a single watch should take 0.2 seconds to execute in order to not create a backlog, with two cores 0.4 seconds and so forth.
We assumed that this meant 2 minutes from when it was added. Is this actually a fixed interval where everything it will trigger all alerts set for the 2m interval?
When you restart your node, the 2m is no more calculated from the last run. But from the time the watches are loaded (so all have the same start time). You can use a cron schedule to specify down to the exact second.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.