Heartbeat fake timeout

Hi,

We use hearbeat to monitor a hundred applications. In general it works well, but sometimes some applications are down (only for a single ping) according to the tool.
However, this is not the case according to our tests. The error is always the following one: dial tcp xx.xx.xxx.xx: 443: i / o timeout (Client.Timeout exceeded while awaiting headers)

The only similar topic we found was this one: Heartbeat timeout.

Regarding our infrastructures, hearbeat is located on a pod (under docker.elastic.co/beats/heartbeat:7.14.1), in a k3s cluster.
The logs, in debug mode, don't give us any additional information.
We thought about a network problem, and verify this by dumping the network in the heartbeat pod.
For example, we had a timeout on the document API, at endpoint /document/system/health at 08:40.
Dump:

469 2021-09-26 08:39:00,075550 HTTP 281 GET /oauth-sap/system/info/ HTTP/1.1 
471 2021-09-26 08:39:00,077059 HTTP 296 GET /comptabilite-auxilliaire/system/info/ HTTP/1.1 
473 2021-09-26 08:39:00,077448 HTTP 281 GET /stats-sia/system/info/ HTTP/1.1
478 2021-09-26 08:39:00,078396 HTTP 287 GET /activedirectory/system/info/ HTTP/1.1 
479 2021-09-26 08:39:00,078451 HTTP 278 GET /pulpes/system/info/ HTTP/1.1 
481 2021-09-26 08:39:00,078612 HTTP 277 GET /proxy/system/info/ HTTP/1.1
482 2021-09-26 08:39:00,078618 HTTP 279 GET /graines/system/info/ HTTP/1.1 
484 2021-09-26 08:39:00,078856 HTTP 297 GET /indicateurconditionnement/system/info/ HTTP/1.1 
485 2021-09-26 08:39:00,078908 HTTP 283 GET /login-extra/system/info/ HTTP/1.1 
487 2021-09-26 08:39:00,079479 HTTP 268 GET /mapproxy/ HTTP/1.1 
492 2021-09-26 08:39:00,079648 HTTP 282 GET /login-adfs/system/info/ HTTP/1.1 
494 2021-09-26 08:39:00,079781 HTTP 294 GET /transporteur-reporting/system/info/ HTTP/1.1 
496 2021-09-26 08:39:00,079915 HTTP 278 GET /divers/system/info/ HTTP/1.1 
497 2021-09-26 08:39:00,079936 HTTP 283 GET /facturation/system/info/ HTTP/1.1 
499 2021-09-26 08:39:00,080053 HTTP 293 GET /livraisons-receptions/system/info/ HTTP/1.1 
500 2021-09-26 08:39:00,080114 HTTP 286 GET /sync-carto-sia/system/info/ HTTP/1.1 
501 2021-09-26 08:39:00,080463 HTTP 278 GET /paille/system/info/ HTTP/1.1 
509 2021-09-26 08:39:00,080906 HTTP 286 GET /capital-social/system/info/ HTTP/1.1 
--510 2021-09-26 08:39:00,080906 HTTP 281 GET /document/system/health HTTP/1.1 
513 2021-09-26 08:39:00,080985 HTTP 283 GET /partenaires/system/info/ HTTP/1.1 
515 2021-09-26 08:39:00,081009 HTTP 283 GET /analyselabo/system/info/ HTTP/1.1 
516 2021-09-26 08:39:00,081165 HTTP 289 GET /droits-production/system/info/ HTTP/1.1 
517 2021-09-26 08:39:00,081180 HTTP 291 GET /planning-betteraves/system/info/ HTTP/1.1 
518 2021-09-26 08:39:00,081234 HTTP 280 GET /enquetes/system/info/ HTTP/1.1 
519 2021-09-26 08:39:00,081304 HTTP 291 GET /negoces-prestations/system/info/ HTTP/1.1 
520 2021-09-26 08:39:00,081349 HTTP 285 GET /notifications/system/info/ HTTP/1.1 
521 2021-09-26 08:39:00,081377 HTTP 282 GET /erc-config/system/info/ HTTP/1.1 
528 2021-09-26 08:39:00,081992 HTTP 292 GET /autorisationcuws-sap/system/info/ HTTP/1.1 
529 2021-09-26 08:39:00,081999 HTTP 281 GET /videcache/system/info/ HTTP/1.1 
530 2021-09-26 08:39:00,082002 HTTP 290 GET /administration-ihm/system/info/ HTTP/1.1 
531 2021-09-26 08:39:00,082047 HTTP 299 GET /parcelles-api/actuator/health/readiness/ HTTP/1.1 
533 2021-09-26 08:39:00,082164 HTTP 284 GET /uidparameter/system/info/ HTTP/1.1 
535 2021-09-26 08:39:00,082200 HTTP 284 GET /fin-campagne/system/info/ HTTP/1.1 
536 2021-09-26 08:39:00,082289 HTTP 281 GET /agronomie/system/info/ HTTP/1.1 
537 2021-09-26 08:39:00,082382 HTTP 273 GET /geoserver/web/ HTTP/1.1 
539 2021-09-26 08:39:00,090326 HTTP 284 GET /transporteur/system/info/ HTTP/1.1 
684 2021-09-26 08:40:00,089762 HTTP 281 GET /agronomie/system/info/ HTTP/1.1 
689 2021-09-26 08:40:00,091605 HTTP 273 GET /geoserver/web/ HTTP/1.1 
690 2021-09-26 08:40:00,091612 HTTP 283 GET /facturation/system/info/ HTTP/1.1 
691 2021-09-26 08:40:00,091617 HTTP 294 GET /transporteur-reporting/system/info/ HTTP/1.1 
692 2021-09-26 08:40:00,091622 HTTP 283 GET /partenaires/system/info/ HTTP/1.1 
694 2021-09-26 08:40:00,092540 HTTP 281 GET /stats-sia/system/info/ HTTP/1.1 
699 2021-09-26 08:40:00,092646 HTTP 285 GET /notifications/system/info/ HTTP/1.1 
701 2021-09-26 08:40:00,092737 HTTP 292 GET /autorisationcuws-sap/system/info/ HTTP/1.1 
702 2021-09-26 08:40:00,092818 HTTP 284 GET /uidparameter/system/info/ HTTP/1.1 
703 2021-09-26 08:40:00,092822 HTTP 293 GET /livraisons-receptions/system/info/ HTTP/1.1 
704 2021-09-26 08:40:00,092841 HTTP 297 GET /indicateurconditionnement/system/info/ HTTP/1.1 
711 2021-09-26 08:40:00,093343 HTTP 282 GET /login-adfs/system/info/ HTTP/1.1 
712 2021-09-26 08:40:00,093357 HTTP 286 GET /sync-carto-sia/system/info/ HTTP/1.1 
713 2021-09-26 08:40:00,093460 HTTP 281 GET /oauth-sap/system/info/ HTTP/1.1 
716 2021-09-26 08:40:00,093497 HTTP 291 GET /negoces-prestations/system/info/ HTTP/1.1 
718 2021-09-26 08:40:00,093546 HTTP 291 GET /planning-betteraves/system/info/ HTTP/1.1 
719 2021-09-26 08:40:00,093555 HTTP 282 GET /erc-config/system/info/ HTTP/1.1 
720 2021-09-26 08:40:00,093564 HTTP 299 GET /parcelles-api/actuator/health/readiness/ HTTP/1.1 
722 2021-09-26 08:40:00,093693 HTTP 286 GET /capital-social/system/info/ HTTP/1.1 
723 2021-09-26 08:40:00,093738 HTTP 277 GET /proxy/system/info/ HTTP/1.1 
724 2021-09-26 08:40:00,093783 HTTP 289 GET /droits-production/system/info/ HTTP/1.1 
732 2021-09-26 08:40:00,094621 HTTP 283 GET /analyselabo/system/info/ HTTP/1.1 
733 2021-09-26 08:40:00,094665 HTTP 296 GET /comptabilite-auxilliaire/system/info/ HTTP/1.1 
734 2021-09-26 08:40:00,094804 HTTP 290 GET /administration-ihm/system/info/ HTTP/1.1 
735 2021-09-26 08:40:00,094830 HTTP 283 GET /login-extra/system/info/ HTTP/1.1 
736 2021-09-26 08:40:00,094842 HTTP 279 GET /graines/system/info/ HTTP/1.1 
737 2021-09-26 08:40:00,094859 HTTP 284 GET /fin-campagne/system/info/ HTTP/1.1 
738 2021-09-26 08:40:00,095018 HTTP 278 GET /paille/system/info/ HTTP/1.1 
742 2021-09-26 08:40:00,095262 HTTP 278 GET /pulpes/system/info/ HTTP/1.1 
743 2021-09-26 08:40:00,095270 HTTP 280 GET /enquetes/system/info/ HTTP/1.1 
744 2021-09-26 08:40:00,095311 HTTP 287 GET /activedirectory/system/info/ HTTP/1.1 
749 2021-09-26 08:40:00,096401 HTTP 278 GET /divers/system/info/ HTTP/1.1 
750 2021-09-26 08:40:00,096411 HTTP 268 GET /mapproxy/ HTTP/1.1 
751 2021-09-26 08:40:00,096419 HTTP 281 GET /videcache/system/info/ HTTP/1.1 
752 2021-09-26 08:40:00,096424 HTTP 284 GET /transporteur/system/info/ HTTP/1.1 
899 2021-09-26 08:41:00,091295 HTTP 280 GET /enquetes/system/info/ HTTP/1.1 
901 2021-09-26 08:41:00,091833 HTTP 268 GET /mapproxy/ HTTP/1.1 
904 2021-09-26 08:41:00,092356 HTTP 283 GET /partenaires/system/info/ HTTP/1.1 
905 2021-09-26 08:41:00,092459 HTTP 290 GET /administration-ihm/system/info/ HTTP/1.1 
907 2021-09-26 08:41:00,092900 HTTP 284 GET /uidparameter/system/info/ HTTP/1.1 
911 2021-09-26 08:41:00,093250 HTTP 286 GET /capital-social/system/info/ HTTP/1.1 
913 2021-09-26 08:41:00,093369 HTTP 284 GET /fin-campagne/system/info/ HTTP/1.1 
915 2021-09-26 08:41:00,093398 HTTP 292 GET /autorisationcuws-sap/system/info/ HTTP/1.1 
916 2021-09-26 08:41:00,093420 HTTP 284 GET /transporteur/system/info/ HTTP/1.1 
917 2021-09-26 08:41:00,093460 HTTP 281 GET /oauth-sap/system/info/ HTTP/1.1 
920 2021-09-26 08:41:00,093827 HTTP 281 GET /stats-sia/system/info/ HTTP/1.1 
921 2021-09-26 08:41:00,093975 HTTP 283 GET /analyselabo/system/info/ HTTP/1.1 
--924 2021-09-26 08:41:00,094259 HTTP 281 GET /document/system/health HTTP/1.1 
937 2021-09-26 08:41:00,095226 HTTP 299 GET /parcelles-api/actuator/health/readiness/ HTTP/1.1 
939 2021-09-26 08:41:00,095331 HTTP 291 GET /negoces-prestations/system/info/ HTTP/1.1 
940 2021-09-26 08:41:00,095353 HTTP 278 GET /divers/system/info/ HTTP/1.1 
943 2021-09-26 08:41:00,095538 HTTP 278 GET /paille/system/info/ HTTP/1.1 
944 2021-09-26 08:41:00,095595 HTTP 294 GET /transporteur-reporting/system/info/ HTTP/1.1 
945 2021-09-26 08:41:00,095618 HTTP 281 GET /videcache/system/info/ HTTP/1.1 
946 2021-09-26 08:41:00,095628 HTTP 283 GET /facturation/system/info/ HTTP/1.1 
947 2021-09-26 08:41:00,095709 HTTP 293 GET /livraisons-receptions/system/info/ HTTP/1.1 
951 2021-09-26 08:41:00,095756 HTTP 283 GET /login-extra/system/info/ HTTP/1.1 
952 2021-09-26 08:41:00,095829 HTTP 281 GET /agronomie/system/info/ HTTP/1.1 
956 2021-09-26 08:41:00,095954 HTTP 277 GET /proxy/system/info/ HTTP/1.1 
958 2021-09-26 08:41:00,096673 HTTP 287 GET /activedirectory/system/info/ HTTP/1.1 
959 2021-09-26 08:41:00,096738 HTTP 279 GET /graines/system/info/ HTTP/1.1 
960 2021-09-26 08:41:00,096808 HTTP 278 GET /pulpes/system/info/ HTTP/1.1 
961 2021-09-26 08:41:00,096886 HTTP 282 GET /login-adfs/system/info/ HTTP/1.1 
962 2021-09-26 08:41:00,096976 HTTP 282 GET /erc-config/system/info/ HTTP/1.1 
963 2021-09-26 08:41:00,097097 HTTP 273 GET /geoserver/web/ HTTP/1.1 
964 2021-09-26 08:41:00,097271 HTTP 296 GET /comptabilite-auxilliaire/system/info/ HTTP/1.1 
965 2021-09-26 08:41:00,097391 HTTP 297 GET /indicateurconditionnement/system/info/ HTTP/1.1 
966 2021-09-26 08:41:00,097824 HTTP 285 GET /notifications/system/info/ HTTP/1.1 
967 2021-09-26 08:41:00,097970 HTTP 289 GET /droits-production/system/info/ HTTP/1.1 
968 2021-09-26 08:41:00,098565 HTTP 291 GET /planning-betteraves/system/info/ HTTP/1.1 
969 2021-09-26 08:41:00,098897 HTTP 286 GET /sync-carto-sia/system/info/ HTTP/1.1

Unfortunately the dump shows us that the request does not come out of the pod at 08:40. It's a problem related to the heartbeat pod.
Do you have any ideas to solve this problem ?

Thanks,
Guillaume

Sorry to hear about these issues, definitely something weird is going on.

Can you tell me more about the log you posted? How was it created? Were you monitoring the network interface directly? If so, with which tool? Are you certain a failed TCP check would show in the monitoring you're doing?

Also, that log seems to be a list of requests, but I don't see responses (and response codes). Are you tracking those as well?

The error about the inability to connect is straight from the go standard library's TCP implementation. It's possible there's an error in how we're using it, but it is unlikely.

Thanks for your help.
This is a simple dump done from the tcpdump command (in the hearbeat pod), which captures network traffic from the pod to our appgateway.
Responses aren't taken into account, we don't have any information about them, only on sending packets.
Command used: nohup tcpdump "dst xx.xx.xxx.xx and tcp port 80" -w /-%Y-%m-%d_%H.%M.%S.pcap -G 1800 -C 20 >/dev/null 2>&1 &

Thanks for the in-depth debugging!

Is that log filtered for just HTTP? The reason I ask is, would a failed TCP 3-way handshake show up? In other words, if it's a timeout, I'm wondering if it's failing on the handshake portion before we even get to HTTP. If so, I'm wondering if that would show up in that log.

The message "Client.Timeout exceeded while awaiting headers" comes from the golang stdlib, and is a little misleading. It will basically happen anytime the request fails due to a timeout, where the phrasing sort of implies we at least have a valid TCP connection.

It's possible that there is a bug in heartbeat here, but given that this is such core functionality, and we've received no other reports, it seems less likely.

Indeed, tcp requests are also traced.
We tried to find the corresponding tcp request related to the http request. It seems that at 08:40, a tcp request (without http request) sends "SYN", with several attempts, but no "ACK".


A normal connection looks like this:

I think you're right, we'll try to figure out where the problem is.
Thank you

No worries, sorry you hit a weird networking bug, those are never fun! But it does sound, indeed, like a failed 3-way handshake.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.