Heartbeat 6.7 regression bug. mode:all always logs a false UP

Hi,

Monitor config:

- type: http
  name: test
  urls:
    - http://elastic.co:8888/v1
  ipv6: false
  mode: all
  schedule: '@every 5s'
  timeout: 2s

In the case of 6.6.2 and 6.7 heartbeat logs all the right DOWN events, as this monitor is NOT supposed to work, it times out. There are more than one because elastic.co resolves to more than one IP and the config uses the mode:all setting.

But in 6.7 heartbeat ALSO logs an UP event, which makes no sense, this is the bug.
The relevant sections of the unexpected UP event:

"monitor": {
      "scheme": "http",
      "id": "test@http://elastic.co:8888/v1",
      "type": "http",
      "name": "test",
      "duration": {
        "us": 1122
      },
      "status": "up"
    },
    "http": {
      "url": "http://elastic.co:8888/v1"
    },
    "tcp": {
      "port": 8888
    },
    "event": {
      "dataset": "uptime"
    }

Relevant section of the expected DOWN event:

"monitor": {
      "name": "test",
      "type": "http",
      "host": "elastic.co",
      "ip": "151.101.66.217",
      "duration": {
        "us": 2000332
      },
      "status": "down",
      "scheme": "http",
      "id": "test@http://elastic.co:8888/v1"
    },
    "http": {
      "url": "http://elastic.co:8888/v1"
    },
    "event": {
      "dataset": "uptime"
    },
    "tcp": {
      "port": 8888
    },
    "resolve": {
      "host": "elastic.co",
      "ip": "151.101.66.217",
      "rtt": {
        "us": 1059
      }
    },
    "error": {
      "type": "io",
      "message": "Get http://elastic.co:8888/v1: dial tcp 151.101.66.217:8888: i/o timeout (Client.Timeout exceeded while awaiting headers)"
    }

When using a working target, like "https://www.elastic.co" the same behavior is observed for 6.7.0, an extraneous UP event is emitted that is missing all the expected fields; http code, timers, IP, etc.
Except in this case the extraneous UP event isn't exactly lying... Like "a broken clock is right twice a day" kinda thing.

So in short with mode:all and 6.7 you get a extraneous, false, incomplete UP event with no IP, no http response code, missing timers, etc. (Those would be there if it was a real UP event, true success, which is impossible here.)
I can't repro this with mode:any so far or in heartbeat 6.6.2. (Didn't test other versions as of right now.)

I also didn't test other potential trigger for the bug, maybe it's not because of mode:all in the end. I'm taken by another bug which I'm trying to isolate/repro effectively where monitors will silently stop to send events completely and a heartbeat restart will fix it. But this one happens with 6.6.2 and still happen in 6.7 so for sure a bug, but I think it's unrelated.

I think the extraneous false UP event described above is a critical regression bug because of the potential issues it would create to people who upgrade. If they have alarms or dashboards where this issue will have nasty effects.

Let me know if someone else can repro and if so, it can go to github I assume.

Martin
Gold license customer
Running self-managed 5.5.2 cluster on AWS ECS.

Hi @martinr_ubi

Thanks for the detailed report, appreciate it. I did not get to reproduce this myself yet but based on reading your report this definitively seems like a bug. Could you open an issue in Github with all the same details you posted here: https://github.com/elastic/beats

Thanks, I detailed it here:

Martin

Thank you.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.