Pingbeat: Recommended architectures?


(Viztastic) #1

Usecase

My use case involves 200,000+ network assets and Pingbeat seems like a good tool to monitor their availability and round trip time.

Constraint

The issue is that our client expects continuous monitoring as there are OH&S implications of not maintaining continuous visibility on equipment.

Question

A simple approach would be to dump the list of IPs as 'targets', but I have a few questions:

  1. What would the performance of this look like? Is there a recommended threshold for how many targets Pingbeat could handle?
  2. Is there a way to dynamically add/remove items to the list without Pingbeat having to shutdown/restart?

One other approach could be to install an agent on each of the assets and get data that way, but Pingbeat allows us to capture data on newly connected devices before they have additional agents installed (e.g. filebeat etc.)

Help/advice appreciated!


(Joshua Rich) #2

Hi @viztastic,

I'm the author of Pingbeat. Sounds like a perfect use-case of this tool. I intended it to be deployed anywhere you needed measurements of latency to be done, potentially duplicating targets to measure latency across different links.

In answer to your questions:

  1. I've tried to design Pingbeat to scale, but I can't really test at large scales effectively. Currently Pingbeat has a fair amount of concurrency built in to sending/receiving pings and this ability will scale with CPU cores (at the cost of some more memory usage). This would mostly be a try-it-and-see scenario. I'd be very curious in any stats you can report!

  2. I don't currently support dynamic reload but always had in the back of my mind implementing this eventually. Assuming the major rewrite I just did pays off this is definitely a feature I'll look into in my next batch of updates.

Best regards

@Joshua_Rich


(Steffen Siering) #3

how many pingbeat instance you want to deploy? You plan on adding some kind of alerting?
You might consider having some redundancy in here. Like have one host being monitored by 2 or 3 pingbeat instances. If one fails to send data to ES (for example pingbeat host being down or network issue) the 'redundant' pingbeat instances can still mark host as available.

200.000+ sounds like quite a lot. Hard to tell without any load-tests how it will behave in your environment though. While beats internally queues some events, there is a chance of data-loss if output can not push data fast enough (pipeline blocking ping-workers or data being dropped). With all endpoints potentially having the same schedule events will be generated in bursts potentially overflowing the queues every now and then, while beat mostly being idle. Queue sizes can be increased though.

Plus the ICMP ping requests potentially all being executed at the same time (and potential DNS queries). Assuming a ping request/response is ~70 bytes in total (depends on optional payload in request) + we don't need to resend request (if no response is received, another request must be made after N seconds) usage might be like:

2 * 70 * 8 * N *  R = N * R * 1120 bit/s

with N being number of hosts in your (sub-)network. And R being factor of redundancy. Assuming N=200000 and R=3 bandwidth requirement goes up to (assuming no DNS or repeated requests) 672 Mbit/s .

With devices normally working on packet level you will need to deal with ~1.2M packets/sec.

Assuming you ping like every 60 seconds, average bandwidth will go down to ~11Mbit/s, but this requires some very smooth schedule in order to keep bandwidth usage very constant (most likely you will see large bursts).

No idea if pingbeat can 'smoothen' the burstiness be limiting number of concurrent tasks per time-unit or tries to create an uniformly distributed schedule (which can be difficult in itself, due to different response times from remotes).

With this number of hosts I'd consider multiple ping-beat hosts, each having a subset of IPs to monitor with some redundancy applied. If locality is somewhat important even consider having a ping-cluster per location.

Question: Is ICMP based ping enough, or you rather like to have some TCP, HTTP, name other protocol ... based ping checking your service is really available instead of just checking if machine is up?

Just some thoughts on the topic though.


(system) #4