Heartbeat configuration for 1000+ IPs

catalin.bulancea · September 12, 2023, 10:31am

Hi gurus,

We have a requirement to monitor1000+ IPs using Heartbeat 7.17.4.

As the baseline test, we pinged one IP address (let's call it device A) with cmd ping and the latency is around 30ms.

Then we started adding dozens of IP addresses in Heartbeat. Pinging the same device A through Heartbeat results in a latency of around 40ms, which is still acceptable.

However, when adding more and more IPs (1000+), the latency of the very same device A increases dramatically 400-500ms.
So it seems clear the Heartbeat cannot handle that amount of devices and adds up an artificial latency.

The configuration we use for all devices is the following:

- type: icmp
  name: CR01
  enabled: true
  schedule: '*/5 * * * * * *'
  ipv4: true
  mode: all
  timeout: 5s
  wait: 1s
  hosts: ["IP address here"]

In addition, we noticed the following message in the Heartbeat logs:

2023-09-10T09:11:04.231-0700    WARN    scheduler/scheduler.go:139    2286 tasks have missed their schedule deadlines by more than 1 second in the last 15s.

We could only find this topic related to the issue:

github.com/elastic/beats

[Heartbeat] Scheduler deadline exceeded message is confusing

opened 09:42AM - 17 Dec 20 UTC

andrewvc

bug Team:Uptime

The warning for tasks that miss their deadline is confusing in https://github.co…m/elastic/beats/blob/master/heartbeat/scheduler/scheduler.go#L152 . It currently reads: `"%d tasks have missed their schedule deadlines in the last %s."` It's really unclear to users what's going on here (to the point I'm labeling this abug), we should make it more friendly, something like: `%d tasks are running behind schedule (previous run not finished when next one is already due to run). ` We also should document troubleshooting this somewhere. There are a number of different potential causes, and it's too much to put in an error message. 1. Constrained scheduler limits for too many monitors (if you have 1000 monitors that each take a second to execute on a 30s interval, and a schedule that constrains us to execute at most 2 at a time, after 60s only 120 monitors will have run, causing the rest to be behind). 1. Heartbeat is actually resource constrained (same as above, but we're just hitting real limits, not artificial ones) 1. A timeout value exceeding the schedule interval (if you check a resource with the default timeout of 16s every 5s, and it takes 10s to run we'll miss a deadline since we don't overlap checks of the same monitor) In the case of the last point, I'm also wondering if we should just suppress the message, because it's very likely you'll hit this state, but it won't actually be an error. Additionally, we should consider listing the specific monitors that are in this state to help users debug this issue.

Is there a scheduler constraint we are hitting?
Can you please advice?

Thank you,
Catalin

Andrew_Cholakian1 · September 12, 2023, 2:22pm

If you switch the schedule to @every 5m that may help.

Heartbeat will run all monitors on startup, then on the schedule thereafter. If you have a very large number of monitors some will be delayed as you flood the network interface etc. However, that also shifts the next runs of those monitors forward. If you let Heartbeat run for a bit you'll see the distribution of tasks even out. It's not ideal, but may work for you.

The problem with cron syntax is it says run this monitor at 0m, 5m, 10m etc. past the hour, all at the same time, rather than being more flexible.

We've discussed changing the schedule algorithm to spread monitor execution out over time, so in your case that would mean some monitors run immediately, others are offset by some amount.

catalin.bulancea · September 13, 2023, 2:33pm

Hi Andrew,

Thanks for your reply.
This sounds like a bug to me. Adding just a couple of dozens of devices in Heartbeat adds an artificial latency.
We can't schedule @every 5m, as the customer wants to ping devices every few seconds. It is crucial for them to know when a device is down as soon as possible.

Is there anything on the roadmap to fix this and to make the scheduler more flexible?

What is the recommended approach in the current situation?
We were thinking of having more machines, even though on one machine only the network card is not being used too much. With 1000+ IPs pinged at the same time, only 200-300 Kb of bandwidth is used.
Then install more Heartbeat instances on each machine (and started at different times).
Finally, have the schedule of each instance spread out over time for batches of devices. For example, have Batch A run @every 2s, Batch B run @every 3s, Batch C run @every 5s, Batch D run @every 7s, etc. Even that way we'll have overlaps.

What do you reckon? Any better option here?

Thanks,
Catalin

Andrew_Cholakian1 · September 14, 2023, 3:32am

Sorry, I think my last comment was a bit unclear, what I was saying was that the @every syntax gives heartbeat more flexibility to schedule (and thereby scale) than the cron style syntax. The reason I mentioned @every 5m was simply that the cron expression you shared was to run on every minute divisible by 5 in the hour. In other words, prefer @every to cron style syntax to let the scheduler rearrange things. There's a big distinction between */5 * * * * and @every 5m in terms of the scheduler being able to arrange thing.

1000+ IPs isn't actually a ton and I'm a bit surprised to hear you're running into issues here, even running them all simultaneously. I'd be more inclined to think it was an issue with your OS / network than heartbeat itself based on others who run large numbers of pings without issue.

Can you give it a shot with @every 5s or whatever interval it is you desire and let me know if that worked out? If you give it a minute or two I'd expect issues to mostly stabilize.

catalin.bulancea · September 27, 2023, 6:41pm

Hi Andrew,

Below is the result of our tests. As you can see, we get the bet results by adding around 10 heartbeat instances and having the IPs divided in batches, and each batch with a different schedule: @every 2s, 3s, 5s, 7s.

Notice the latency based on your recommendation at the right: 1 heartbeat instance, 1 schedule @every 5s.

Thanks,
Catalin

system · October 25, 2023, 8:41pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Heartbeat scheduled jobs already active Beats heartbeat	2	860	June 26, 2019
Heartbeat scalability Beats heartbeat	4	700	February 1, 2019
Heartbeat schedule not work when has large number of hosts Beats heartbeat	2	358	December 10, 2020
Heartbeat not running all monitors when scheduler limit is set Beats heartbeat	4	670	December 16, 2019
Stagger pings within a monitor Beats heartbeat	3	890	September 15, 2017

Heartbeat configuration for 1000+ IPs

Related topics