I'm using Heartbeat version 7.4.2. I've got 40+ HTTP monitors configured for Heartbeat, all of which use the cron syntax for scheduling. I'm using '0 0 */2 * * * *', meaning every 2 hours. I also have the scheduler.limit set to 10, so only 10 tasks at a time. When the time rolls around to run the monitors, not all monitors are run. In fact, they don't run until 2 more hours later (sometimes not even then). Sometimes there will be monitors that don't run for 6+ hours.
Am I misunderstanding how the scheduler limit should work? Or is this a bug? Thoughts and suggestions please.
That cron config 0 0 */2 * * * * means every 2 days at midnight. I think you want 0 */2 * * * * * which means the first minute every 2 hours. I'd recommend using the syntax @every 2h instead, since it's much more straightforward to read.
EDIT: My mistake, I was reading it wrong, you are right, that is every 2 hours. I'm going to try and repro this.
That said, I'd be interested to hear what use case you have for running heartbeat so infrequently? I've been mulling whether we should put a cap on the maximum amount of time between checks (there's some potential to improve perf in ES queries if we can depend on that).
Generally people don't check more than say 5 minutes apart. The main reason is that if a check fails for a transient reason you want to see how long it took to recover. With a two hour check, it will take 2 full hours to check.
Please don't cap the maximum amount of time between checks. At least allow every couple hours. We do have good reason for running our checks infrequently. We have kind of a unique use case and heartbeat I hope is the solution that will solve all our maintenance woes. Adding a cap of less than an hour will make heartbeat useless to us. So please don't cap it. I'll message you our specific use case.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.