Schedule not working correctly anymore

Hello,

after the most recent update of heartbeat (5.2.0 to 5.4.2), heartbeat apparently sends more than one event per minute, even though it is defined to send one per minute.
The last few months, we got 1440 (sometimes 1439, sometimes 1441, but that's fine) events per day - which matches exactly the number of minutes in a day.
After upgrading to 5.4.2, we received 1521 on 2017-06-26m, 1604 on 2017-06-27, 1620 on 2017-06-28 etc.

Here is my heartbeat configuration:

#======== Monitors ========#
heartbeat.monitors:
  - type: 'http'
    name: 'first'
    schedule: '@every 60s'
    urls:
      - http://www.someurl.com
    timeout: 20
    status: 200

#======== General========#
name: hb
tags:
  - tag1
  - ...

#======== Processors ========#
processors: []

#======== Outputs ========#
output.logstash:
  enabled: True
  hosts:
    - localhost:1234
  index: index

I reconstructed the issue in a virtualbox, here is my results:

                 Date
June 29th 2017, 09:54:56.521                   (0)
June 29th 2017, 09:53:56.522                   (1)
June 29th 2017, 09:52:56.522                   (2)
June 29th 2017, 09:51:56.529                   (3)
June 29th 2017, 09:49:56.521                   (4)
June 29th 2017, 09:48:56.522                   (5)
June 29th 2017, 09:48:56.522                   (6)
June 29th 2017, 09:47:56.521                   (7)
...

When looking at a longer time interval, it appears that around every 5 to 10 minutes one of these double events (see line 5 & 6) seems to occur.

Duplicate events might be due to heartbeat having to resend an event, due to an error in the communication in logstash. Heartbeat retries to publish an event up to 3 times, before dropping.

Which logstash version are you using?

Please check heartbeat and logstash logs for errors.

Hello @steffens, using logstash-5.4.2 and heartbeat-5.4.2.
Apparently, there is an error in heartbeat from time to time like

2017-06-29T16:10:25+02:00 ERR Failed to publish events caused by: read tcp from->to:6033: read: connection reset by peer
2017-06-29T16:10:25+02:00 INFO Error publishing events (retrying): read tcp from->to:6033: read: connection reset by peer

It looks like heartbeat then sends the event twice - and I have no idea why.
Another observation: I tried it in an isolated environment using the cron expression "* * * * *" instead of "@every 60s" and it seemed to work.
I will report back on this tomorrow though.

looks like Logstash (or the Host, intermediate device) did close the connection while heartbeat has been waiting for ACK.

Without ACK, heartbeat can not tell if the event has been processed by Logstash or not. That is, heartbeat has to assume processing failed and send the event again. Duplicates can happen if Logstash did manage to process an event, but the ACK was not returned.

This you can not really influence from within heartbeat, as the problem was either within Logstash or one of it's outputs. Still, logstash should return a Keepalive every now and then, if it's actively processing a events published by heartbeat.

Maybe Logstash logs (might be debug mode), can be more enlightening.

The beats input in Logstash has a client_inactivty_timeout closing incoming connections. Increasing the timeout sometimes helps as well.

Hello, thanks for the hint with client_inactivity_timeout. I increased the value and now it appears to work.
Looks like the issue was, that the client_inactivity_timeout was set to 60sec and heartbeat sent events every 60 seconds which resulted in the behavior desribed above.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.