Hi there,
we've got several heartbeats running for the uptime monitoring aspect. Usually it works fine but sometimes the uptime says the applications are down (while our second tool and our manual tests say they are not) and the error is mostly the following:
Client.Timeout exceeded while awaiting header
We've got about 325 applications being monitored, i've tried splitting the amount of monitors into several hearbeats which solved most of the problems but sometimes we are still getting this. Any tips or tips on how we can finetune this part?
325 applications is a pretty small number, even when tested quite frequently, so I doubt this is due to load on heartbeat. That said, it'd be good to know if the box running heartbeat is hitting resource limits. Are you seeing spikes in CPU or I/O during these outages?
Barring that, how frequently is heartbeat saying these apps are down? The reason I ask is that if heartbeat runs a check, say, every 10s, and the app is available 99.99% of the time you'd expect to see an outage every 27 hours or so. Most people only alert if they see a couple consecutive outages for this reason. In other words, it could have to do with the frequency of the checks and your app / network's actual availability.
Lastly, it's possible this is a bug in heartbeat, but this isn't something we have other open issues about, so that seems unlikely given its stability.
Hi Andrew,
Thank you for taking the time to help me. The server is not running into any resource limits, frankly that was the first one i was checking because that would be the most logical explanation. No the CPU is handling just fine unfortunately.
Most of the time it is saying the application is down, it is scanning the endpoints every 2 minutes and it might differ which application is going down every check but it seems like it can handle at most around 30-40 monitors. I've split up the heartbeat even more and now it seems to run stable but in order to get it to do so, we apparently need around 7 heartbeat instances running. But to answer your question, it gives us the down-time status way more often the we'd like it too.
Hi Andrew, we've split the monitors even more but still we are getting timeouts once in a while. Any idea or any optimization setting which we can try?
Apologies for the delay here. Unfortunately this is a tricky one at this stage. If you're up for it, what would really help would be to get a packet dump if it's reproducible.
There are two theories:
Your application is responding, but heartbeat, for whatever reason, can't figure that out
Your application isn't liking heartbeat, and is timing out in talking to it for some reason.
Unfortunately this requires a dump, I'd use Wireshark personally (or tcpdump). Is this something you're willing / able to do? It's a weird one because this at this point we're potentially debugging the go network stack.
I'm fine with fixing a dump for you but seeing as i have not done that before i might need some guidance. Could you point me in a direction on how i can create a dump for you?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.