Specifying Multiple Success Status Codes for Heartbeat Monitor

TL;DR

How do I specify multiple success response codes for an Elastic Heartbeat monitor?

Details

We have an application for which http.response.code: 200 and http.response.code: 403 are both considered successful.

However, in the heartbeat.yml file, it's only possible to specify a single success code for check.response.status.

This response to another question references using processors for this purpose, but there appear to be two issues with this:

  1. Processors don't appear to be intended for this purpose. According to the processor documentation:

    You can use processors to filter and enhance data before sending it to the configured output.

  2. It doesn't work, when I specify it as below.

I've attached my /etc/heartbeat/monitors.d/my-app.http.yml below.

What am I missing? Is there detailed documentation for the check.response... section of the YAML?

- type: http # monitor type `http`. Connect via HTTP an optionally verify response

  # Monitor name used for job name and document type
  name: my-app

  # Enable/Disable monitor
  enabled: true

  # Configure task schedule
  schedule: '@every 30s' # every 5 seconds from start of beat

  # Configure URLs to ping
  urls:
    - http://my-app.example.com

  # Configure IP protocol types to ping on if hostnames are configured.
  # Ping all resolvable IPs if `mode` is `all`, or only one IP if `mode` is `any`.
  ipv4: true
  ipv6: true
  mode: any
  
  # Expected response settings
  # check:
  #   response:
  #     # Expected status code. If not configured or set to 0 any status code not
  #     # being 404 is accepted.
  #     status: 200

  processors:
    - or:
      - equals:
          http.response.code: 200

      - equals:
          http.response.code: 403

I updated to the following for my-app.http.yml. This now reports a monitor status of up for http.response.status_code of 200 or 403.

- type: http # monitor type `http`. Connect via HTTP an optionally verify response

  # Monitor name used for job name and document type
  name: my-app

  # Enable/Disable monitor
  enabled: true

  # Configure task schedule
  schedule: '@every 30s' # every 5 seconds from start of beat

  # Configure URLs to ping
  urls:
    - https://my-app.example.com

  # Configure IP protocol types to ping on if hostnames are configured.
  # Ping all resolvable IPs if `mode` is `all`, or only one IP if `mode` is `any`.
  ipv4: true
  ipv6: true
  mode: any

  # Success on HTTP Reponse 200/OK.
  check:
    response:
      status: 200

  # Return success (monitor.status: up) for specified additional HTTP response codes.
  # If multiple add'l success codes, use an `or` within the `if` condition.
  processors:
    - if:
        equals:
          http.response.status_code: 403

      then:
        - drop_fields:
            fields:
              - error
              - monitor.status

        - add_fields:
            target: monitor
            fields:
              status: up

Hmmmm, the first approach should have worked. I'll look into it today. There's a chance we're internally not casting the numeric type right.

Thanks for the report!

Thanks.

The alternative that I worked out works just fine for me. What bugs me more about this is that I accidentally grabbed the filebeat processor reference, rather than the heartbeat processor reference, so I initially wrote the following processor:

processors:
  - if:
      equals:
        http.response.status_code: 403

    then:
      - drop_fields:
          ignore_missing: true
          fields:
            - error
            - monitor.status

      - add_fields:
          target: monitor
          fields:
            status: up

...which would've allowed me to dispense with the check.response.status: 200. However, I got the following error:

Sep 04 15:30:15 host001 heartbeat[38354]: 2019-09-04T15:30:15.981-0500        ERROR        monitors/monitor.go:221        Failed to load monitor processors: could not load monitor processors: failed to make if/then/else processor: unexpected ignore_missing option in processors.0.then.0.drop_fields

When I dove into the correct documentation (heartbeat, instead of filebeat), it appears that the heartbeat drop_fields processor does not include the ignore_missing: option. I'm at a loss to explain why, as it would be just as handy in the heartbeat as the filebeat.

Not to mention, as a matter of preference, it would be nice to have an update_fields processor rather than have to do a drop/add on the field. :smile:

Are you sure they're the same version? The processors for heartbeat and filebeat are the exact same code.

Hmm. That may be it. It looks as if I was using the "master" documentation for the filebeat reference and the "current" (7.3.1) reference for the heartbeat, which is what I'm using for the heartbeat.

That makes more sense.

I have to apologize, I misread your original post. That shouldn't work because it's just a boolean expression without an action. So your fix does make sense.

That said, you may find in the future that you may want to scope your logic to just one monitor, which the global processors section can't do easily. You may want to check out per-monitor processors, where you can specify the logic per monitor.

Can I ask why you are globally considering 403 statuses as successes? Another approach here would be to set check.response.status to 403 for the services that are expected to return that. Generally a service should return exactly one correct status code.

I'll also add one problem with doing this change in processors is that it will break some of our calculations for the summary.up/down fields which is required for multi-location support among other things in the Uptime Kibana app. It's a bit complicated, but you can see this GH issue for more details.

They're not being considered a success globally. This is actually for that specific monitor, if you look at the indentation level for the initial monitor again. This is configured only for the my-app monitor. As for the 403 being considered a success for this application, the tl;dr is "because that's what the application team requested." The longer answer is that the application team periodically expects a 403 and, in this case, it doesn't mean that the application isn't functioning correctly.

I'll take a look at this as well. However, I took a deep dive into what they're actually doing, and there's a better way to do it that doesn't involve a multi-success code scenario, which is well within Heartbeat's capabilities. It was a typical XY problem, and I didn't ask the correct questions going in.

Thanks for prompting me to think along those lines with your earlier question about why we're considering 403 status as a success.

:smiley:

Thanks for the feedback here. It'd be great to hear what you find. I'm actually coming to the opposite conclusion. If it's useful to our users, such as yourself, to match multiple response codes, maybe we should make that easier and let status code take an array.

The reason we haven't done that so far is that it is a bit of an anti-pattern in that services should really return one response code. That said, that's not very useful to an operator dealing with a service that they may have little control of whether it's due to another team, or a vendor etc.

I've opened this GH issue to track this https://github.com/elastic/beats/issues/13595

That said, if you can meet with the app team and get them to return a single code, that's probably still the best solution :slight_smile: