Multiple hosts per monitor confuses certificate validation since 7.17.0

Since upgrading Heartbeat to 7.17.0 all my monitors have gone haywire.

It seems that if you define multiple hosts per monitor like so:

- type: http
  id: cicd
  name: CI/CD
  hosts:
    - https://a.corp.internal
    - https://b.corp.internal
    - https://c.corp.internal
  check.response.status: [200]
  max_redirects: 3
  schedule: '@every 30s'

That you will get issues like:

io:Get "https://a.corp.internal": x509: certificate is valid for a.corp.internal, not b.corp.internal

This happens for each monitor where there are multiple hosts, with differing FQDNs.
Monitors with a single host, or monitors with multiple hosts on the same FQDN, don't have this issue.

My setup was working fine for months until I updated from 7.16.3 to 7.17.0 today.

Update: couldn't find the cause of the issue. Reverting back to 7.16.3 worked.

1 Like

Sorry to hear you're hitting this issue. It's a tricky one to debug because I'm having trouble replicating it. I tried to do so with the following config:

- type: http
  id: cicd
  name: CI/CD
  hosts:
    - https://elastic.co
    - https://google.com
  check.response.status: [200]
  max_redirects: 3
  schedule: '@every 30s'

however, that all seemed to work.

Can you replicate this behavior against any public sites so that we could reproduce it? The strange thing here is that 7.17.0 doesn't contain any changes that should impact this AFAIK.

Reading through the error it sounds like somehow heartbeat is mixing up the cert for one endpoint with that of another, however, I would think my attempt at replication would reveal that same issue.

I'll need some time to create a test setup to reproduce it. This was our production system which I had to roll back. I'll get back to you next week.

I have been able to reproduce the exact same behavior with a brand new setup. Different host, brand new containers, volumes, etc.

- type: http
  id: cicd
  name: CI/CD
  hosts:
    - https://bitbucket.corp.internal/status
    - https://jira.corp.internal/status
  check.response.status: [200]
  max_redirects: 3
  schedule: '@every 30s'
  ssl:
    certificate_authorities:
    - /etc/pki/ca-trust/source/anchors/corp-ca-bundle.pem
    supported_protocols:
    - TLSv1.1
    - TLSv1.2

All it takes to trigger the behavior is to have a minimum of 2 FQDN (in the same monitor) which share a domain+TLD but have their own (non-wildcard) certificates.
The reports will alternate between:

io:Get "https://jira.corp.internal/status": x509: 
certificate is valid for jira.corp.internal, not bitbucket.corp.internal

and

io:Get "https://bitbucket.corp.internal/status": x509: 
certificate is valid for bitbucket.corp.internal, not jira.corp.internal

So you are right, heartbeat (or elastic) is mixing up the certs between endpoints. And its not constant either. It's alternating in some unknown fashion.
image

Its very hard for me to provide an example with open-internet URL's, since my coporate network has an TLS interceptor, which obfuscates a lot.

But perhaps this would work for you (provided both have their own certificates, not shared wildcard cert).

  hosts:
    - https://discuss.elastic.co/
    - https://community.elastic.co/

Like I said. They have to share the same domain+TLD for the problem to occur.
Using heartbeat 7.16.3 instead of 7.17.0 immediately fixes the issue (with the Elasticsearch version being constant at 7.17.0).

@PayBas I managed to reproduce the issue as you described, it looks to be a very specific edge case that involves domains with a common suffix and using non-wildcard certificates.

I found two workarounds:

  1. Set ssl.verificaton_mode: certificate
- type: http
  id: cicd
  name: CI/CD
  enabled: true
  urls:
    - https://bitbucket.corp.internal
    - https://jira.corp.internal
  check.response.status: [200]
  max_redirects: 3
  schedule: '@every 5s'
  ssl:
    certificate_authorities:
      - /home/tiago/devel/certGen/rootCA.crt
    verification_mode: certificate
  1. Use different monitors for hosts that share a suffix and use non-wildcard certificates (in other words: hosts that cannot share a TLS certificate)
- type: http
  id: jira
  name: Jira
  enabled: true
  urls:
    - https://jira.corp.internal
  check.response.status: [200]
  max_redirects: 3
  schedule: '@every 5s'
  ssl:
    certificate_authorities:
      - /home/tiago/devel/certGen/rootCA.crt

- type: http
  id: bitbucket
  name: Bitbucket
  enabled: true
  urls:
    - https://bitbucket.corp.internal
  check.response.status: [200]
  max_redirects: 3
  schedule: '@every 5s'
  ssl:
    certificate_authorities:
      - /home/tiago/devel/certGen/rootCA.crt

2 Likes

@TiagoQueiroz thank you for confirming my issue.

Workaround 2 (splitting monitors) is not practical for me. This would make my monitor files hundreds of lines long :). But I'll definitely look into ssl.verificaton_mode: certificate.

Otherwise I'll stick with 7.16.3 for now. The fact that I appear to be the only one running into this 7.17.0 bug suggests that it is indeed an edge case (one which hopefully will be addressed in a future release).

My question now is: do you guys create a ticket at https://github.com/elastic/beats/issues?q=is%3Aopen+is%3Aissue+label%3AHeartbeat or should I?

1 Like

I've opened an issue here https://github.com/elastic/beats/issues/30290 with additional thoughts. Let's continue the conversation here. Thanks @PayBas and @TiagoQueiroz for investigating here

1 Like

@PayBas Here is the fix: https://github.com/elastic/beats/pull/30305

In the end it was a race condition, it seems to only happen when the hosts cannot share certs and the reply is very quick.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.