Heartbeat tcp type monitor errors with io read tcp, i/o timeout

FaizanGit · November 6, 2024, 8:56pm

I am using the Heartbeat version 8.6 and am connecting to self-managed elasticsearch version 8.11.4. Currently, I have over 60 http type monitors running and am successful in viewing their status in kibana. I added one additional tcp type monitor in the heartbeat.yml file, but encountered an error that I need help with.

io:read tcp <source-ip-address-where-heartbeat-is-running>:<random-port> -> <target-hostname1-ip-address>:<target-hostname1-port>: i/o timeout

Snippet from the heartbeat.yml configuration file:

- type: tcp
  enabled: true
  id: Monitor1
  name: MonitorOne
  schedule: '@every 120s'
  timeout: 60
  hosts: ["<target-hostname1>"]
  ports: [<target-hostname1-port>]
  ssl.enabled: false
  check.send: "<Message1>"
  check.receive: "True"

- type: http
  enabled: true
  id: Monitor2
  name: MoitorTwo
  schedule: '@every 120s'
  timeout: 60
  urls: ["https://<target-hostname2>/"]

output.elasticsearch:
  hosts: ["<elasticsearh>:<port>"]
  protocol: "https"
  allow_older_versions: true
  ssl.verification_mode: "none"
  username: "<user-name>"
  password: "<password>"

I have an application on server "target-hostname1" that listens on "target-hostname1-port", will accept a string “Message1” and will respond with a string “True”.
From the same source server where the heartbeat is running, I am able to run successful tests to rule out any firewall issues. I confirmed that MonitorOne application can receive and answer with strings over TCP. The response from the MonitorOne application is far below the timeout value.

echo "Message1" | curl -ivk  telnet://<target-hostname1>:<target-hostname1-port>
and
echo "Message1" | nc -v <target-hostname1> <target-hostname1-port>
both above commands from the terminal responds with string "True".

From the server logs, I was able to confirm that a request came in with "Message1" and the response was also sent out from the application.
I tried modifying Monitor1 the timeout value to give more time for the "target-hostname1" to respond, but the behavior does not changed. I continuously received the error message with different source port number as follows from the heartbeat:

<source-ip-address-where-heartbeat-is-running>:<random-port-1> -> <target-hostname1-ip-address>:<target-hostname1-port>: i/o timeout
<source-ip-address-where-heartbeat-is-running>:<random-port-2> -> <target-hostname1-ip-address>:<target-hostname1-port>: i/o timeout
<source-ip-address-where-heartbeat-is-running>:<random-port-3> -> <target-hostname1-ip-address>:<target-hostname1-port>: i/o timeout

When I remove the check.send and check.receive from MonitorOne, the heartbeat application considers as MonitorOne to be “up”. This confirms that TCP connection was successful. But when I keep check.send and remove check.receive, I continue to get the i/o time out error in the form of

<source-ip-address-where-heartbeat-is-running>:<random-port> -> <target-hostname1-ip-address>:<target-hostname1-port>: i/o timeout

Can anybody provide some thoughts on why the heartbeat cannot read the resonse from the application that runs on "target-hostname1-ip-address":"target-hostname1-port" ?
I already checked the link Postgresql tcp check is not working, seeking further help.

FaizanGit · November 6, 2024, 10:02pm

Added elastic-stack-monitoring

emilioalvap · November 12, 2024, 3:17pm

Hi @FaizanGit,

Thanks for providing such a detailed summary! From the description, it would seem the sent body is not being closed completely before heartbeat times out. Curl and netcat are both more flexible in that regard so they might be able to cope with it.
I'd suggest recording and comparing network tcp dumps, that might show the difference.

FaizanGit · November 14, 2024, 10:50pm

I modified my application on server "target-hostname1" so that it receives and respond with the same string "Message1". Hence, I updated my heartbeat.yml file as follows for MonitorOne:

- type: tcp
  enabled: true
  id: Monitor1
  name: MonitorOne
  schedule: '@every 120s'
  timeout: 60
  hosts: ["<target-hostname1>"]
  ports: [<target-hostname1-port>]
  ssl.enabled: false
  check.send: "<Message1>"
  check.receive: "<Message1>"

Additionally, I captured the incoming data from the same server where heartbeat is running using tcpdump. When the heartbeat app is running, below is the traffic received from "target-hostname1":"target-hostname1-port"

IP "target-hostname1"."target-hostname1-port" > "source-ip-address-where-heartbeat-is-running"."random-port1": Flags [S.], seq 2933006261, ack 1307184683, win 65535, options [mss 1460,nop,wscale 8,nop,nop,sackOK], length 0
E..4.m@.v._.
W       H....'t......M..+....................
IP 10.87.9.72.10100 > 147.16.27.21.42642: Flags [.], ack 10, win 1026, length 0
E..(.n@.v._.
W       H....'t......M..4P....`........
IP "target-hostname1"."target-hostname1-port" > "source-ip-address-where-heartbeat-is-running"."random-port1": Flags [.], ack 11, win 1026, length 0
E..(.o@.v._.
W       H....'t......M..5P...._........
IP "target-hostname1"."target-hostname1-port" > "source-ip-address-where-heartbeat-is-running"."random-port1": Flags [P.], seq 1:12, ack 11, win 1026, length 11
E..3.p@.v._.
W       H....'t......M..5P......."Message1"

IP "target-hostname1"."target-hostname1-port" > "source-ip-address-where-heartbeat-is-running"."random-port1": Flags [F.], seq 12, ack 11, win 1026, length 0
E..(.q@.v._.
W       H....'t......M..5P....S........

Same kind of tcpdump is captured from server where heartbeat is running when I interact with my application using netcat and curl. Can somebody help with explanation on:

In which format heartbeat is reading a response in check.receive?
My application responds within 10th of the timeout value configured in heartbeat, why would it work with netcat and curl, but with heartbeat, it wouldn't work?
From the heartbeat.yml file when I removed both check.send and check.receive, the heartbeat considers that the monitor is up, but when check.receive is removed from the heartbeat.yml file, it should accept any returned value from my application and hence should show that MonitorOne app is 'up', but it continues to show down with the same error. This is not according to the heartbeat tcp monitor documentation. Why the heartbeat is determining that my app is down?

Topic		Replies	Views
Heartbeat - verify tcp ports failed with timeout and socket error Beats heartbeat	3	1521	November 17, 2017
Heartbeat monitoring for a particular URL fails Kibana	2	405	December 23, 2021
Tcp/tcp.go:230 dial failed with: dial tcp localhost:8095: i/o timeout Elasticsearch elastic-agent	1	562	May 10, 2022
HEARTBEAT: most of HTTP monitors fail from time to time Beats	1	477	August 25, 2019
Heartbeat kibana http check not working Beats elastic-stack-monitoring , heartbeat	3	478	November 20, 2020

Heartbeat tcp type monitor errors with io read tcp, i/o timeout

Related topics