Heartbeat configuration for 1000+ IPs

Hi gurus,

We have a requirement to monitor1000+ IPs using Heartbeat 7.17.4.

As the baseline test, we pinged one IP address (let's call it device A) with cmd ping and the latency is around 30ms.

Then we started adding dozens of IP addresses in Heartbeat. Pinging the same device A through Heartbeat results in a latency of around 40ms, which is still acceptable.

However, when adding more and more IPs (1000+), the latency of the very same device A increases dramatically 400-500ms.
So it seems clear the Heartbeat cannot handle that amount of devices and adds up an artificial latency.

The configuration we use for all devices is the following:

- type: icmp
  name: CR01
  enabled: true
  schedule: '*/5 * * * * * *'
  ipv4: true
  mode: all
  timeout: 5s
  wait: 1s
  hosts: ["IP address here"]

In addition, we noticed the following message in the Heartbeat logs:

2023-09-10T09:11:04.231-0700    WARN    scheduler/scheduler.go:139    2286 tasks have missed their schedule deadlines by more than 1 second in the last 15s.

We could only find this topic related to the issue:

Is there a scheduler constraint we are hitting?
Can you please advice?

Thank you,
Catalin

If you switch the schedule to @every 5m that may help.

Heartbeat will run all monitors on startup, then on the schedule thereafter. If you have a very large number of monitors some will be delayed as you flood the network interface etc. However, that also shifts the next runs of those monitors forward. If you let Heartbeat run for a bit you'll see the distribution of tasks even out. It's not ideal, but may work for you.

The problem with cron syntax is it says run this monitor at 0m, 5m, 10m etc. past the hour, all at the same time, rather than being more flexible.

We've discussed changing the schedule algorithm to spread monitor execution out over time, so in your case that would mean some monitors run immediately, others are offset by some amount.

Hi Andrew,

Thanks for your reply.
This sounds like a bug to me. Adding just a couple of dozens of devices in Heartbeat adds an artificial latency.
We can't schedule @every 5m, as the customer wants to ping devices every few seconds. It is crucial for them to know when a device is down as soon as possible.

Is there anything on the roadmap to fix this and to make the scheduler more flexible?

What is the recommended approach in the current situation?
We were thinking of having more machines, even though on one machine only the network card is not being used too much. With 1000+ IPs pinged at the same time, only 200-300 Kb of bandwidth is used.
Then install more Heartbeat instances on each machine (and started at different times).
Finally, have the schedule of each instance spread out over time for batches of devices. For example, have Batch A run @every 2s, Batch B run @every 3s, Batch C run @every 5s, Batch D run @every 7s, etc. Even that way we'll have overlaps.

What do you reckon? Any better option here?

Thanks,
Catalin

Sorry, I think my last comment was a bit unclear, what I was saying was that the @every syntax gives heartbeat more flexibility to schedule (and thereby scale) than the cron style syntax. The reason I mentioned @every 5m was simply that the cron expression you shared was to run on every minute divisible by 5 in the hour. In other words, prefer @every to cron style syntax to let the scheduler rearrange things. There's a big distinction between */5 * * * * and @every 5m in terms of the scheduler being able to arrange thing.

1000+ IPs isn't actually a ton and I'm a bit surprised to hear you're running into issues here, even running them all simultaneously. I'd be more inclined to think it was an issue with your OS / network than heartbeat itself based on others who run large numbers of pings without issue.

Can you give it a shot with @every 5s or whatever interval it is you desire and let me know if that worked out? If you give it a minute or two I'd expect issues to mostly stabilize.

Hi Andrew,

Below is the result of our tests. As you can see, we get the bet results by adding around 10 heartbeat instances and having the IPs divided in batches, and each batch with a different schedule: @every 2s, 3s, 5s, 7s.

Notice the latency based on your recommendation at the right: 1 heartbeat instance, 1 schedule @every 5s.

Thanks,
Catalin

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.