Heartbeat scalability

dorry · January 4, 2019, 8:43am

Hello,
i was wondering if i could seek some advice please. We have a use case for Heartbeat to monitor ICMP ping reachability around 25,000 network devices.

Does anyone have any insight or best practices on how to make Heartbeat scale to those kind of numbers please ?

Thanks in advance,
David

Andrew_Cholakian1 · January 4, 2019, 9:55pm

Hi David, great question.

So the short answer is yes heartbeat can do that job. The other question is how?

So, let me ask for some more detail:

Are these devices all on the same network, or on multiple separate networks?
How often do you need to check them? Checking every second vs. every minute requires 60x the resources!
What are your criteria for them being up? Keep in mind with ICMP you may not want alerts to fire unless multiple pings are down since ICMP is lossy.

In general when it comes to performance the best answer we can give is how to tune the software. Since all hardware / networks are different this is important to remember.

Key settings when setting up heartbeat for large numbers of hosts are:

Timeout: This number should be less than the schedule interval. Keep in mind that heartbeat has to keep resources open for the entire duration of the timeout. A shorter timeout may free up resources
heartbeat.scheduler.limit: This constrains the number of checks that can execute simultaneously. If you have too many checks execute simultaneously this can overwhelm your network interface. Having them go out at a more measured rate can help.

dorry · January 4, 2019, 10:23pm

Hi Andrew,

Many thanks for this insight - much appreciated.

On your questions ...

Are these devices all on the same network, or on multiple separate networks?
They are spread across many separate networks.
How often do you need to check them? Checking every second vs. every minute requires 60x the resources!
Indeed ! At the moment there maybe some requirements to ping some groups of devices more frequently than others but current thoughts are to ping every 5 mins initially and then increase that frequency gradually to once per min using a short data retention time frame and roll up on the one min data.
What are your criteria for them being up? Keep in mind with ICMP you may not want alerts to fire unless multiple pings are down since ICMP is lossy.

We are having our own interesting debate on this topic because responding to ICMP does not necessarily mean the device is up and operating correctly so all we can really say is that ICMP will confirm the device is reachable. In the fullness of time it would be good to have alerting.

Many thanks for the insight on the timeout and heartbeat.scheduler.limit parameters - we had not considered those in our design.

Thanks,
David

Andrew_Cholakian1 · January 4, 2019, 11:07pm

Now that I know a little more, one other thing you'll want to consider in your design is what it means for something to be reachable. It is typical to have one heartbeat inside a network monitoring if the device is up, and to have one or more outside the network checking if it is externally reachable.

Let us know if you have any further questions. Additionally, it'd be great to hear from you guys about what architecture you land on. It sounds like an interesting use case, and we'd love to learn from it

system · February 1, 2019, 11:07pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Heartbeat ICMP Hosts Monitoring Beats heartbeat	2	927	January 1, 2018
HeartBeat ICMP question Beats heartbeat	3	348	May 11, 2020
[heartbeat]how to get performace test on heartbeat Beats heartbeat	9	618	September 8, 2021
Configuration of heartbeat on Scale Beats heartbeat	4	464	September 1, 2022
Heartbeat configuration for 1000+ IPs Beats heartbeat	5	469	October 25, 2023

Heartbeat scalability

Related topics