Heartbeat: Ping TImeout on hosts causes state.duration_ms to stay at 0?

scantron · May 22, 2023, 9:09pm

We are monitoring a few hosts using ICMP on heartbeat. Heartbeat accurately depicts the uptime using the state.duration_ms field. However, shutting down a host for testing leads to the error.message field of "ping timeout", which is fine, but whenever this error message occurs, the state.duration_ms field stays at 0. The state.id also updates on every event, meaning all of the state subfields are constantly reset, and do not track from when the host first went down. We want to be able to track for how long the host has been down for using that state.duration_ms field, but we can't because Heartbeat considers each ping timeout to be a new state, and thus the duration_ms field does not properly track. Is there anything we can do to achieve this?

jkambic · May 23, 2023, 2:50pm

Hi @scantron, sorry to hear you're having issues using Heartbeat.

This is a feature that has undergone numerous iterations throughout v8.x. Could you please let me know what version of the Elastic Stack you are running?

scantron · May 23, 2023, 4:52pm

I am running the latest, 8.7

jkambic · May 24, 2023, 2:31pm

Thanks for sharing that - so far I haven't been able to reproduce this issue on my side.

Some additional things that could help diagnose the issue:

Are you able to include any config for your monitor?
It is odd that you're seeing a different state ID on each ping. Are you using run once mode?

Andrew_Cholakian1 · May 24, 2023, 2:47pm

It may also be helpful to take a look at the logs, are any errors showing up?

scantron · May 24, 2023, 4:56pm

Hi J, thanks for your response. Here is a sample config for the host that is having this problem.

Under the monitors.d folder, we have this config enabled:

  id: sampleid123
  name: samplename123
  enabled: true
  schedule: '@every 10s'
  hosts: [
   "the-problem-host"
  ]
  ipv6: true
  mode: any
  timeout: 5s
  wait: 1s
  tags: ["tag1"]
  fields_under_root: true
  fields:
    field1: value1

Names and fields have been swapped out, but all options are the same.

Here are two sample logs that demonstrate this behavior in JSON format:

Log 1

{
  "_index": "heartbeat_index",
  "_id": "hCwHTogB0CLBbzCvXggs",
  "_version": 1,
  "_score": null,
  "_source": {
    "error": {
      "type": "io",
      "message": "ping timeout"
    },
      "hostname": "the-problem-host"
    },
    "state": {
      "down": 1,
      "duration_ms": "0",
      "up": 0,
      "ends": {
        "status": "up",
        "duration_ms": "0",
        "up": 1,
        "ends": null,
        "down": 0,
        "checks": 1,
        "flap_history": null,
        "id": "default-1884e074654-6e51",
        "started_at": "2023-05-24T13:51:19.636833008Z"
      },
      "status": "down",
      "id": "default-1884e0759dc-6e52",
      "flap_history": [],
      "started_at": "2023-05-24T13:51:24.636739175Z",
      "checks": 1
    },
    "summary": {
      "up": 0,
      "down": 1
    },
    "monitor": {
      "duration": {
        "us": 5002514
      },
      "name": "samplename1",
      "timespan": {
        "gte": "2023-05-24T13:51:24.636Z",
        "lt": "2023-05-24T13:51:34.636Z"
      },
      "check_group": "0cb1a619-fa3a-11ed-a6b6-00163e77ee12",
      "status": "down",
      "ip": "redacted",
      "type": "icmp",
      "id": "sampleid1"
    },
    "field1":"value1",
    "@version": "1",
    "agent": {
      "type": "heartbeat",
      "ephemeral_id": "4773b85c-53b9-43af-b353-1777ec80b0d9",
      "id": "c51f1f5c-afbf-458b-a0aa-feb5ae48312b",
      "name": "heartbeat_instance",
      "version": "8.7.0"
    }
  },
  "fields": {
    "@timestamp": [
      "2023-05-24T13:51:19.634Z"
    ],
    "monitor.timespan.lt": [
      "2023-05-24T13:51:34.636Z"
    ],
    "state.ends.started_at": [
      "2023-05-24T13:51:19.636Z"
    ],
    "state.started_at": [
      "2023-05-24T13:51:24.636Z"
    ],
    "monitor.timespan.gte": [
      "2023-05-24T13:51:24.636Z"
    ]
  }

Log 2

{
  "_index": "heartbeat_index",
  "_id": "rSwHTogB0CLBbzCvbQh_",
  "_version": 1,
  "_score": null,
  "_source": {
    "error": {
      "type": "io",
      "message": "ping timeout"
    },
      "hostname": "the-problem-host"
    },
    "state": {
      "checks": 1,
      "duration_ms": "0",
      "up": 0,
      "ends": {
        "status": "up",
        "duration_ms": "0",
        "up": 1,
        "ends": null,
        "down": 0,
        "checks": 1,
        "flap_history": null,
        "id": "default-1884e0755a7-6cbf",
        "started_at": "2023-05-24T13:51:23.559937829Z"
      },
      "down": 1,
      "id": "default-1884e07692f-6cc0",
      "flap_history": [],
      "started_at": "2023-05-24T13:51:28.559665451Z",
      "status": "down"
    },
    "summary": {
      "up": 0,
      "down": 1
    },
    "monitor": {
      "duration": {
        "us": 5001712
      },
      "name": "samplename1",
      "timespan": {
        "gte": "2023-05-24T13:51:28.559Z",
        "lt": "2023-05-24T13:51:38.559Z"
      },
      "check_group": "0f0814db-fa3a-11ed-bc8d-00163eee7ae2",
      "type": "icmp",
      "ip": "redacted",
      "id": "sampleid1",
      "status": "down"
    },
    "field1":"value1",
    "@version": "1",
    "agent": {
      "type": "heartbeat",
      "ephemeral_id": "a68c31e5-eba7-4944-b88d-d5020917b488",
      "id": "422d8f5a-bbe0-4a7b-92d9-542ef9192560",
      "name": "heartbeat_instance",
      "version": "8.7.0"
    }
  },
  "fields": {
    "@timestamp": [
      "2023-05-24T13:51:23.557Z"
    ],
    "monitor.timespan.lt": [
      "2023-05-24T13:51:38.559Z"
    ],
    "state.ends.started_at": [
      "2023-05-24T13:51:23.559Z"
    ],
    "state.started_at": [
      "2023-05-24T13:51:28.559Z"
    ],
    "monitor.timespan.gte": [
      "2023-05-24T13:51:28.559Z"
    ]
  }

These logs occur one right after another, and should compound the duration that the host has been down, but you can see the state ID changes, and thus all state subfields also reset. Any irrelevant or personal fields have been deleted or redacted.

The heartbeat.yml config file has no special settings configured. The output is kafka, and it contains only a couple of processors that copy field values into different field names.

scantron · June 2, 2023, 3:29pm

bump

system · June 30, 2023, 5:30pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

emilioalvap · July 4, 2023, 3:46pm

Hi @scantron,

Something that caught my attention about the documents you provided:

In the example <Log 1>, a new down state with id: default-1884e0759dc-6e52 is ending a previous up state with id: default-1884e074654-6e51.
Whereas in <Log 2>, the new down state with id: default-1884e07692f-6cc0 is ending a previous up state with id: default-1884e0755a7-6cbf.

So there're are definitely two distinct up states at the time that these monitors ran. I then noticed that both documents have a different agent id, as if there were running on different heartbeat instances:

<Log 1>

"agent": {
      "type": "heartbeat",
      "ephemeral_id": "4773b85c-53b9-43af-b353-1777ec80b0d9",

<Log 2>

"agent": {
      "type": "heartbeat",
      "ephemeral_id": "a68c31e5-eba7-4944-b88d-d5020917b488",

Can you tell us a bit more about how these are set up? If you're running two, or more, instances of heartbeat with the same monitor config, you'll get a different state id per instance as these are handled locally.

Topic		Replies	Views
HeartBeat : missing pings Synthetics	7	455	December 30, 2020
Heartbeat Ping Beats heartbeat	2	403	February 28, 2023
Heartbeat-elastic site monitoring error (Web down) Beats heartbeat	3	1593	July 4, 2019
Uptime alert recovers even though host is still down Elasticsearch elastic-stack-alerting	1	109	January 5, 2024
Heartbeat Down Counter Kibana	4	364	November 19, 2021

Heartbeat: Ping TImeout on hosts causes state.duration_ms to stay at 0?

Related topics