Hello,
i was wondering if i could seek some advice please. We have a use case for Heartbeat to monitor ICMP ping reachability around 25,000 network devices.
Does anyone have any insight or best practices on how to make Heartbeat scale to those kind of numbers please ?
So the short answer is yes heartbeat can do that job. The other question is how?
So, let me ask for some more detail:
Are these devices all on the same network, or on multiple separate networks?
How often do you need to check them? Checking every second vs. every minute requires 60x the resources!
What are your criteria for them being up? Keep in mind with ICMP you may not want alerts to fire unless multiple pings are down since ICMP is lossy.
In general when it comes to performance the best answer we can give is how to tune the software. Since all hardware / networks are different this is important to remember.
Key settings when setting up heartbeat for large numbers of hosts are:
Timeout: This number should be less than the schedule interval. Keep in mind that heartbeat has to keep resources open for the entire duration of the timeout. A shorter timeout may free up resources
heartbeat.scheduler.limit: This constrains the number of checks that can execute simultaneously. If you have too many checks execute simultaneously this can overwhelm your network interface. Having them go out at a more measured rate can help.
Are these devices all on the same network, or on multiple separate networks?
They are spread across many separate networks.
How often do you need to check them? Checking every second vs. every minute requires 60x the resources!
Indeed ! At the moment there maybe some requirements to ping some groups of devices more frequently than others but current thoughts are to ping every 5 mins initially and then increase that frequency gradually to once per min using a short data retention time frame and roll up on the one min data.
What are your criteria for them being up? Keep in mind with ICMP you may not want alerts to fire unless multiple pings are down since ICMP is lossy.
We are having our own interesting debate on this topic because responding to ICMP does not necessarily mean the device is up and operating correctly so all we can really say is that ICMP will confirm the device is reachable. In the fullness of time it would be good to have alerting.
Many thanks for the insight on the timeout and heartbeat.scheduler.limit parameters - we had not considered those in our design.
Now that I know a little more, one other thing you'll want to consider in your design is what it means for something to be reachable. It is typical to have one heartbeat inside a network monitoring if the device is up, and to have one or more outside the network checking if it is externally reachable.
Let us know if you have any further questions. Additionally, it'd be great to hear from you guys about what architecture you land on. It sounds like an interesting use case, and we'd love to learn from it
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.