[Heartbeat]: add retry logic into http/tcp check

Hello Beats friends,
Wondering if current heartbeat supports retry in http/tcp check?
Current Heartbeat ails me since looks like there's no retry logic in http/tcp check.
As far as I can find, only icmp support retry logic like the following:
There’s a config named wait , it controls the duration to wait before emitting another ICMP Echo Request. The default is 1 second:

That means if the total timeout is 10 sec(default is 16sec), it will try 10 times if the target is not response.

Could you corrent me if there's retry login in http/tcp check?
If so, can you document them?
If not, I'd like to contribute the retry logic feature for http/tcp check.

Thanks.

Hello,

I am not aware that HeartBeat has a retry logic for http/tcp. My best guess for the existence of the wait option for ICMP is that some servers might drop ICMP packages in favor of the actual data:

Some servers (and some routers) may specifically block (or down-prioritize) ICMP echo requests or where TTL=0. These routers (or the final destination) might show 100% packet loss or high packet loss and latency.

TCP and HTTP on the other hand have the ability to recognise lost packages and retransmit them.

Can you explain your usecase why you need a retry for TCP/HTTP? Do you have a messy network with high packet loss?

Depending on the usecase you could also implement a workaround in the dashboards/alerts, e.g.: You could run the monitor every second and configure the Alerts to only execute the action if all heartbeats failed in the last 10 seconds.

Best regards
Wolfram

@Wolfram_Haussig
Thanks for you clarification.
Yes, the wait in ICMP check is not such a retry logic as I thought, it’s just happening within a single ICMP check.
Actually, Heartbeat only needs the first echo response after an echo request is sent, then wait for wait seconds, will send another echo request if no echo response received.
The output of ICMP check logic like this by capturing ICMP packages via tcpdump:

23:40:16.979984 IP a > b: ICMP echo request, id 46893, seq 1, length 52
23:40:17.981122 IP a > b: ICMP echo request, id 46893, seq 2, length 52

And I’m afraid we cannot do such workaround like you suggested since we have multiple Heartbeat instances, there’re almost 90K targets we need to do http check on a single Heartbeat instance.
The problem will be like this, right now, we have 90K metrics per minute, if we run these http check every second, we will have 90K * 60 metrics, it requires more significant storage.

Yes, you are right. and it is not only the storage that is significant when pinging 90k of targets - it will be also a lot of traffic in your network.

The only way I can think of - beside extending heartbeat with a HTTP retry logic - is using LogStash:

  • main pipeline
    • input gets a document containing target and retryCounter(either file or kafka input)
    • use http filter to ping the target
    • if success: transform document to conform to heartbeat structure and send it to elasticsearch with status success
    • if failure and retryCounter < 10: trigger main pipeline with retryCounter+1
    • if failure and retryCounter >=10: transform document to conform to heartbeat structure and send it to elasticsearch with status failure
  • trigger pipeline
    • every 10 seconds read all targets to monitor(e.g. ElasticSearch input)
    • trigger main pipeline with retryCounter=0

The downside is:

  • you need a LogStash server(irrelevant if you already have one)
  • you have to do the field mapping to Heartbeat schema yourself in the LogStash pipeline
  • LogStash does not support pipelines writing to itself so you need either a file for input/output or a Kafka installation at best

Disclaimer: We use that logic every 15 minutes to read data from a REST webservice which requires paging of large results. I don't know how this will scale if called every 10 seconds.

Maybe another user has a better idea?

Best regards
Wolfram

Hmm, Heartbeat would be the uniform solution for our monitoring architecture(http/tcp/icmp check).
We would have no time to investigate how to use LogStash to do the http check as you suggested right now.
Actually, the code changes for adding retry logic for http check is ready from our side, I’m about to file a PR to community.
Thanks for you details, appreciating for your suggestion.

1 Like