Heartbeat: Ping TImeout on hosts causes state.duration_ms to stay at 0?

We are monitoring a few hosts using ICMP on heartbeat. Heartbeat accurately depicts the uptime using the state.duration_ms field. However, shutting down a host for testing leads to the error.message field of "ping timeout", which is fine, but whenever this error message occurs, the state.duration_ms field stays at 0. The state.id also updates on every event, meaning all of the state subfields are constantly reset, and do not track from when the host first went down. We want to be able to track for how long the host has been down for using that state.duration_ms field, but we can't because Heartbeat considers each ping timeout to be a new state, and thus the duration_ms field does not properly track. Is there anything we can do to achieve this?

Hi @scantron, sorry to hear you're having issues using Heartbeat.

This is a feature that has undergone numerous iterations throughout v8.x. Could you please let me know what version of the Elastic Stack you are running?

I am running the latest, 8.7

Thanks for sharing that - so far I haven't been able to reproduce this issue on my side.

Some additional things that could help diagnose the issue:

  • Are you able to include any config for your monitor?
  • It is odd that you're seeing a different state ID on each ping. Are you using run once mode?

It may also be helpful to take a look at the logs, are any errors showing up?

Hi J, thanks for your response. Here is a sample config for the host that is having this problem.

Under the monitors.d folder, we have this config enabled:

  id: sampleid123
  name: samplename123
  enabled: true
  schedule: '@every 10s'
  hosts: [
   "the-problem-host"
  ]
  ipv6: true
  mode: any
  timeout: 5s
  wait: 1s
  tags: ["tag1"]
  fields_under_root: true
  fields:
    field1: value1

Names and fields have been swapped out, but all options are the same.

Here are two sample logs that demonstrate this behavior in JSON format:

Log 1

{
  "_index": "heartbeat_index",
  "_id": "hCwHTogB0CLBbzCvXggs",
  "_version": 1,
  "_score": null,
  "_source": {
    "error": {
      "type": "io",
      "message": "ping timeout"
    },
      "hostname": "the-problem-host"
    },
    "state": {
      "down": 1,
      "duration_ms": "0",
      "up": 0,
      "ends": {
        "status": "up",
        "duration_ms": "0",
        "up": 1,
        "ends": null,
        "down": 0,
        "checks": 1,
        "flap_history": null,
        "id": "default-1884e074654-6e51",
        "started_at": "2023-05-24T13:51:19.636833008Z"
      },
      "status": "down",
      "id": "default-1884e0759dc-6e52",
      "flap_history": [],
      "started_at": "2023-05-24T13:51:24.636739175Z",
      "checks": 1
    },
    "summary": {
      "up": 0,
      "down": 1
    },
    "monitor": {
      "duration": {
        "us": 5002514
      },
      "name": "samplename1",
      "timespan": {
        "gte": "2023-05-24T13:51:24.636Z",
        "lt": "2023-05-24T13:51:34.636Z"
      },
      "check_group": "0cb1a619-fa3a-11ed-a6b6-00163e77ee12",
      "status": "down",
      "ip": "redacted",
      "type": "icmp",
      "id": "sampleid1"
    },
    "field1":"value1",
    "@version": "1",
    "agent": {
      "type": "heartbeat",
      "ephemeral_id": "4773b85c-53b9-43af-b353-1777ec80b0d9",
      "id": "c51f1f5c-afbf-458b-a0aa-feb5ae48312b",
      "name": "heartbeat_instance",
      "version": "8.7.0"
    }
  },
  "fields": {
    "@timestamp": [
      "2023-05-24T13:51:19.634Z"
    ],
    "monitor.timespan.lt": [
      "2023-05-24T13:51:34.636Z"
    ],
    "state.ends.started_at": [
      "2023-05-24T13:51:19.636Z"
    ],
    "state.started_at": [
      "2023-05-24T13:51:24.636Z"
    ],
    "monitor.timespan.gte": [
      "2023-05-24T13:51:24.636Z"
    ]
  }

Log 2

{
  "_index": "heartbeat_index",
  "_id": "rSwHTogB0CLBbzCvbQh_",
  "_version": 1,
  "_score": null,
  "_source": {
    "error": {
      "type": "io",
      "message": "ping timeout"
    },
      "hostname": "the-problem-host"
    },
    "state": {
      "checks": 1,
      "duration_ms": "0",
      "up": 0,
      "ends": {
        "status": "up",
        "duration_ms": "0",
        "up": 1,
        "ends": null,
        "down": 0,
        "checks": 1,
        "flap_history": null,
        "id": "default-1884e0755a7-6cbf",
        "started_at": "2023-05-24T13:51:23.559937829Z"
      },
      "down": 1,
      "id": "default-1884e07692f-6cc0",
      "flap_history": [],
      "started_at": "2023-05-24T13:51:28.559665451Z",
      "status": "down"
    },
    "summary": {
      "up": 0,
      "down": 1
    },
    "monitor": {
      "duration": {
        "us": 5001712
      },
      "name": "samplename1",
      "timespan": {
        "gte": "2023-05-24T13:51:28.559Z",
        "lt": "2023-05-24T13:51:38.559Z"
      },
      "check_group": "0f0814db-fa3a-11ed-bc8d-00163eee7ae2",
      "type": "icmp",
      "ip": "redacted",
      "id": "sampleid1",
      "status": "down"
    },
    "field1":"value1",
    "@version": "1",
    "agent": {
      "type": "heartbeat",
      "ephemeral_id": "a68c31e5-eba7-4944-b88d-d5020917b488",
      "id": "422d8f5a-bbe0-4a7b-92d9-542ef9192560",
      "name": "heartbeat_instance",
      "version": "8.7.0"
    }
  },
  "fields": {
    "@timestamp": [
      "2023-05-24T13:51:23.557Z"
    ],
    "monitor.timespan.lt": [
      "2023-05-24T13:51:38.559Z"
    ],
    "state.ends.started_at": [
      "2023-05-24T13:51:23.559Z"
    ],
    "state.started_at": [
      "2023-05-24T13:51:28.559Z"
    ],
    "monitor.timespan.gte": [
      "2023-05-24T13:51:28.559Z"
    ]
  }

These logs occur one right after another, and should compound the duration that the host has been down, but you can see the state ID changes, and thus all state subfields also reset. Any irrelevant or personal fields have been deleted or redacted.

The heartbeat.yml config file has no special settings configured. The output is kafka, and it contains only a couple of processors that copy field values into different field names.

bump

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Hi @scantron,

Something that caught my attention about the documents you provided:

  • In the example <Log 1>, a new down state with id: default-1884e0759dc-6e52 is ending a previous up state with id: default-1884e074654-6e51.
  • Whereas in <Log 2>, the new down state with id: default-1884e07692f-6cc0 is ending a previous up state with id: default-1884e0755a7-6cbf.

So there're are definitely two distinct up states at the time that these monitors ran. I then noticed that both documents have a different agent id, as if there were running on different heartbeat instances:

  • <Log 1>
"agent": {
      "type": "heartbeat",
      "ephemeral_id": "4773b85c-53b9-43af-b353-1777ec80b0d9",
  • <Log 2>
"agent": {
      "type": "heartbeat",
      "ephemeral_id": "a68c31e5-eba7-4944-b88d-d5020917b488",

Can you tell us a bit more about how these are set up? If you're running two, or more, instances of heartbeat with the same monitor config, you'll get a different state id per instance as these are handled locally.