ECE Unhealthy platform, internal server error

A quick update, having determined that the AC needed capacity to restart as HA, and that the new host I added had not successfully taken the roles, I decided to terminate the new host and try again with a fresh one - and fresh token. This time docker ps showed plenty of activity. Automating this to reliably replace terminated hosts will be a focus now.

I then tried to restart the AC cluster and the failure was licensing related not capacity - I'm aware the trial license has recently run out.

      "message": "Starting step: [check-enterprise-license]: [{\"adding_instances\":true,\"resize_up_instances\":false,\"is_major_version_change\":false,\"trial_duration\":\"30 days\"}]",
      "message": "Unexpected error during step: [check-enterprise-license]: [no.found.constructor.steps.licensing.EnterpriseLicenseInvalidException: Abort upgrade: [No valid enterprise license: [out_of_date]

I guess replacing lost capacity counts as adding instances

Yep - Because the AC cluster was shut down, starting it up again counts as adding a new capacity (which triggers a check against the license)

(If you contact someone, I'm sure you can get a trial extension to finish getting up and running)

Automating this to reliably replace terminated hosts will be a focus now.

Incidentally if you haven't done so I recommend creating a persistent: true token for all the roles and storing it somewhere secure, in case someone accidentally removes all the coordinator roles.

(The system spits out an emergency token for just this on install, so you may already have such a thing)

OK some time later I have applied a license extension using the API (UI is still fubar)

I removed and re-added the additional capacity which eventually was successful.

I used docker to restart the adminconsole containers, then used the API to _stop and _restart the adminconsole cluster.

I see different problems in the UI, there is a _search that is timing out when trying to view the deployments; ece-region is still pending a long time then timing out.

Using the get cluster API the adminconsole cluster claims to be healthy.

Checking the logs for the adminconsole container I can see current errors as follows:

[2020-03-17 11:27:40,600][INFO ][no.found.adminconsole.elasticsearch.IndexConfigurationActor] Waiting for admin cluster to become reachable ([Unexpected status code [503] r
eceived from ElasticSearch: [

503 Service Unavailable


No server is available to handle this request. ]]). Retrying every [5 seconds]. {} co.elastic.cloud.infrastructure.elasticsearch.spray.SprayElasticsearchClient$UnexpectedStatusException: Unexpected status code [503] received from ElasticSearch: [

503 Service Unavailable

No server is available to handle this request. ]

@Alex_Piggott do you have any suggestions for further diagnosis or repair?

So I see two things here:

1] unless you were snapshotting the AC cluster you can't just _stop and _restart it, because of the rather basic way in which we load templates and mappings

What you have to do is:

  • _stop the AC cluster
  • restart the ACs from docker (docker restart frc-admin-consoles-admin-console)
  • Now you can _restart the AC cluster

2] That said, if you are getting 503s that's a bit surprising and suggests something else has gone wrong

Can you hit the AC cluster directly? Eg do you get a 40x if you just do curl '$AC_CLUSTER_ID.$ROOT_URL:9200'?

If it fails, what does curl -u admin:$PASSWORD "localhost:12400/api/v1/clusters/elasticsearch/$AC_CLUSTER_ID" return?

Alex,

yes I can hit the AC cluster directly, I used the endpoint from the GET clusters API so didn't need the :9200.

Could ECE have lost track of where to find the AC cluster somehow as there seems to be a mismatch - some indication it's running fine and some that it's unable to serve.
The docker logs for the adminconsole service on the host that's pointed at in the load balancer still has the same current errors about waiting for admin cluster to become reachable...

Unless you deleted the cluster as well as stopping it (doesn't sound like that's the case), it won't have lost track of which id it has.

On the AC where it's failing, try curl -H "X-Found-Cluster: $AC_CLUSTER_ID" localhost:9244 ... that's how the AC connects to it.

If that fails, try docker restart frc-services-forwarders-services-forwarder (I think that's right - if not it will be obvious which one from docker ps).

What does this mean? Which API endpoint did you use exactly, and what reply did you get?

That's a good question.
In the get clusters output I was looking for admin console related things and found...
"cluster_name": "admin-console-elasticsearch",
...
it has cluster_id and deployment_id which are identical and also
"metadata": {
"cloud_id": "admin-console-elasticsearch:",
"endpoint": "70bceb62fa7f480395ca2cb9d6573fa3.",

In AWS our domain has a wildcard record backed by a load balancer, the target group of which listens on port 80 and forwards to 12400
So I was really hitting 12400 rather than 9200
There isn't a mapping for :9200 currently
(:12400 gave a login page)

Ah - maybe some progress.
curl -H "X-Found-Cluster: 70bceb62fa7f480395ca2cb9d6573fa3" localhost:9244

503 Service Unavailable

No server is available to handle this request.

This corresponds to the errors I was seeing in the logs for AC...

I then restarted the frc-services-forwarders-services-forwarder...

I still get the 503 error

For the record here is the GET clusters response for the adminconsole cluster ID

{
  "associated_apm_clusters": [ ],
  "associated_appsearch_clusters": [],
  "associated_kibana_clusters": [],
  "cluster_id": "70bceb62fa7f480395ca2cb9d6573fa3",
  "cluster_name": "admin-console-elasticsearch",
  "deployment_id": "70bceb62fa7f480395ca2cb9d6573fa3",
  "elasticsearch": {
    "blocking_issues": {
      "cluster_level": [],
      "healthy": true,
      "index_level": []
    },
    "healthy": true,
    "master_info": {
      "healthy": true,
      "instances_with_no_master": [],
      "masters": [{
        "instances": ["instance-0000000013", "instance-0000000014", "instance-0000000015"],
        "master_instance_name": "instance-0000000013",
        "master_node_id": "n_-MXOyrSyiQj3HeXBjh6g"
      }]
    },
    "shard_info": {
      "available_shards": [{
        "instance_name": "instance-0000000013",
        "shard_count": 0
      }, {
        "instance_name": "instance-0000000014",
        "shard_count": 0
      }, {
        "instance_name": "instance-0000000015",
        "shard_count": 0
      }],
      "healthy": true,
      "unavailable_replicas": [{
        "instance_name": "instance-0000000013",
        "replica_count": 0
      }, {
        "instance_name": "instance-0000000014",
        "replica_count": 0
      }, {
        "instance_name": "instance-0000000015",
        "replica_count": 0
      }],
      "unavailable_shards": [{
        "instance_name": "instance-0000000013",
        "shard_count": 0
      }, {
        "instance_name": "instance-0000000014",
        "shard_count": 0
      }, {
        "instance_name": "instance-0000000015",
        "shard_count": 0
      }]
    }
  },
  "external_links": [{
    "id": "cluster-logs",
    "label": "Elasticsearch Logs",
    "uri": "https://7de7a663825b45fca47d5e4bec23f092.<OUR_DOMAIN>:9243/app/kibana#/discover?_a=(columns:!(message),index:'cluster-logs-*',query:(query_string:(query:'ece.cluster:%2270bceb62fa7f480395ca2cb9d6573fa3%22')))&_g=(time:(from:now-1h,mode:quick,to:now))"
  }, {
    "id": "metricbeat",
    "label": "Elasticsearch Metrics",
    "uri": "https://7de7a663825b45fca47d5e4bec23f092.<OUR_DOMAIN>:9243/app/kibana#/dashboard/AV4REOpp5NkDleZmzKkE?_a=(filters:!(('$state':(store:appState),meta:(alias:!n,disabled:!f,index:'metricbeat-*',key:ece.cluster,negate:!f,params:(query:'70bceb62fa7f480395ca2cb9d6573fa3',type:phrase),type:phrase,value:'70bceb62fa7f480395ca2cb9d6573fa3'),query:(match:(ece.cluster:(query:'70bceb62fa7f480395ca2cb9d6573fa3',type:phrase))))))&_g=(time:(from:now-1h,mode:quick,to:now))"
  }, {
    "id": "proxy-logs",
    "label": "Proxy logs",
    "uri": "https://7de7a663825b45fca47d5e4bec23f092.<OUR_DOMAIN>:9243/app/kibana#/discover?_a=(columns:!(status_code,request_method,request_path,request_length,response_length,response_time),index:'proxy-logs-*',query:(query_string:(query:'handling_cluster:%2270bceb62fa7f480395ca2cb9d6573fa3%22')))&_g=(time:(from:now-1h,mode:quick,to:now))"
  }],
  "healthy": true,
  "links": {

  },
  "metadata": {
    "cloud_id": "admin-console-elasticsearch:<credentials>",
    "endpoint": "70bceb62fa7f480395ca2cb9d6573fa3.<OUR_DOMAIN>",
    "last_modified": "2020-03-17T13:36:09.505Z",
    "ports": {
      "http": 9200,
      "https": 443
    },
    "version": 38
  },
  "plan_info": {
    "current": {
      "attempt_end_time": "2020-03-17T13:35:18.153Z",
      "attempt_start_time": "2020-03-17T13:34:35.247Z",
      "healthy": true,
      "plan": {
        "cluster_topology": [{
          "elasticsearch": {
            "enabled_built_in_plugins": [],
            "user_bundles": [],
            "user_plugins": [],
            "user_settings_json": {

            }
          },
          "instance_configuration_id": "data.default",
          "memory_per_node": 4096,
          "node_count_per_zone": 1,
          "node_type": {
            "data": true,
            "ingest": true,
            "master": true,
            "ml": false
          },
          "zone_count": 3
        }],
        "deployment_template": {
          "id": "default"
        },
        "elasticsearch": {
          "version": "6.8.4"
        }
      },
      "plan_attempt_id": "5a3c210c-1298-4fc1-b35b-71d2469acfc1",
      "plan_attempt_log": [],
      "source": {
        "action": "elasticsearch.restart-cluster",
        "admin_id": "admin",
        "date": "2020-03-17T13:34:35.169Z",
        "facilitator": "adminconsole",
        "remote_addresses": ["0:0:0:0:0:0:0:1"]
      }
    },
    "healthy": true,
    "history": []
  },
  "snapshots": {
    "count": 0,
    "healthy": true,
    "recent_success": false
  },
  "status": "started",
  "topology": {
    "healthy": true,
    "instances": [{
      "allocator_id": "10.5.3.153",
      "container_started": true,
      "healthy": true,
      "instance_configuration": {
        "id": "data.default",
        "name": "data.default",
        "resource": "memory"
      },
      "instance_name": "instance-0000000015",
      "maintenance_mode": false,
      "memory": {
        "instance_capacity": 4096
      },
      "service_id": "YVewIqo7QIKQ1c8GX3Hjlg",
      "service_roles": ["data", "master", "ingest"],
      "service_running": true,
      "service_version": "6.8.4",
      "zone": "MY_ZONE-3"
    }, {
      "allocator_id": "10.5.2.152",
      "container_started": true,
      "healthy": true,
      "instance_configuration": {
        "id": "data.default",
        "name": "data.default",
        "resource": "memory"
      },
      "instance_name": "instance-0000000014",
      "maintenance_mode": false,
      "memory": {
        "instance_capacity": 4096
      },
      "service_id": "tfB7bW_8S6yGirRuhtXLeg",
      "service_roles": ["data", "master", "ingest"],
      "service_running": true,
      "service_version": "6.8.4",
      "zone": "MY_ZONE-2"
    }, {
      "allocator_id": "10.5.6.186",
      "container_started": true,
      "healthy": true,
      "instance_configuration": {
        "id": "data.default",
        "name": "data.default",
        "resource": "memory"
      },
      "instance_name": "instance-0000000013",
      "maintenance_mode": false,
      "memory": {
        "instance_capacity": 4096
      },
      "service_id": "n_-MXOyrSyiQj3HeXBjh6g",
      "service_roles": ["data", "master", "ingest"],
      "service_running": true,
      "service_version": "6.8.4",
      "zone": "MY_ZONE-1"
    }]
  }
}

Wow this is an interesting one

So we saw that curl -H "X-Found-Cluster: $AC_CLUSTER_ID" localhost:9244 failed

Despite the AC cluster showing as 100% healthy, and the issue wasn't with the services forwarder (or at least a restart didn't fix it!). So ....

If you go to one of the proxy hosts and run curl -H "X-Found-Cluster: $AC_CLUSTER_ID" localhost:9200, do you also get 503? If so, can you docker restart frc-proxies-proxy (again double check I got the exact right name)

Yes, the X-Found-Cluster request to localhost:9244 failed. This is on the server that is running all roles and is the first in line for the load balancer hits.

I did restart the services forwarder on that same server and saw no change.

We don't have separate proxy hosts, but the proxy runs on the same server so I tried the same request to localhost:9200 and got a different response - 401 - it requires authentication.

I don't think I have the credentials for the admin console cluster? I tried the ECE admin user/password but that wasn't it.

curl -H "X-Found-Cluster: 70bceb62fa7f480395ca2cb9d6573fa3" localhost:9200

{"error":{"root_cause":[{"type":"security_exception","reason":"action [cluster:monitor/main] requires authentication","header":{"WWW-Authenticate":["Bearer realm=\"security
\"","ApiKey","Basic realm=\"security\" charset=\"UTF-8\""]}}],"type":"security_exception","reason":"action [cluster:monitor/main] requires authentication","header":{"WWW-Au
thenticate":["Bearer realm=\"security\"","ApiKey","Basic realm=\"security\" charset=\"UTF-8\""]}},"status":401}

The point was if you get the 401 you've actually hit the cluster

This is very odd because the AC -> AC cluster path is:

AC -> 9244 (services forwarder) -> 9243 (proxy) -> 9200 (proxy) -> AC cluster instance

So we've now established that 9200-> works, 9244-> doesn't work

So I suppose for completeness you should do curl -k "X-Found-Cluster: $AC_CLUSTER_ID" localhost:9243 and assuming that works, we've isolated the problem to the services forwarder (and a restart doesn't fix it)

So I suppose the next step would be to go to /mnt/data/elastic/:runner-id/services/services-forwarder and check for any odd looking entries in ./logs (also look at docker logs frc-services-forwarders-services-forwarder) and I think there's some config files in /managed that should contain a list of proxies (if you have more than one host with a proxy role), you could try sshing into each such host and repeating the :9200 check?

Alex,

yes ->9244 gives the 503 error - service unavailable
->9243 gives 401 - authentication needed, so works (with a bit of help, I used:
curl -k -H "X-Found-Cluster: 70bceb62fa7f480395ca2cb9d6573fa3" https://localhost:9243
->9200 gives 401 also.

I found the logs but didn't see anything that stood out - lots of INFO entries.

[2020-03-18 13:42:57,791][INFO ][no.found.servicesforwarder.haproxy.ServicesForwarderHaproxyConfigurationWriter] Writing updated client haproxy configuration file for proxi
es:[ProxyPublicInstances(Map(10.5.2.152 -> ProxyPublicInstance({"metadata":{},"attributes":{"proxy_host_ip":"10.5.2.152","proxy_public_hostname":"10.5.2.152","zone":"MY_ZON
E-2"},"v2proxyhealth":{"routing_table":{"elasticsearch":{"deployments":"6","instances":"14"},"kibana":{"deployments":"4","instances":"2"}}},"health":{"healthy":true,"curato
r":{"state":"RECONNECTED","since":"2020-03-16T15:59:14.938Z"},"dynamic_settings":{"initialized":true,"settings":{"http":{"cluster":{"max_outstanding_requests":2000},"cookie
s":{"cookie_secret":"-","user_cookie_key":"user"},"sso":{"sso_secret":"-","max_age":3600000,"cookie_name":"found-sso","maintenance_bypass_cookie_name":"found-maintenance-by
pass","default_redirect_path":"/","dont_log_requests":false},"dashboards":{"base_url":"http://d1wko2d5vgoj53.cloudfront.net/_found.no/dashboards"},"healthcheck":{"minimum_p
roxy_services":1,"disconnected_cutoff":259200000}},"signature_valid_for_millis":60000,"signature_secret":"-","signature_future_signed_tolerance_millis":0}},"appsearch":{"in
itialized":true,"counts":{"cached":1,"clusters":0,"instances":0,"connectors":0,"lockdown":0}},"elasticsearch":{"initialized":true,"counts":{"cached":81,"clusters":6,"instan
ces":14,"connectors":14,"lockdown":0}},"enterprisesearch":{"initialized":true,"counts":{"cached":0,"clusters":0,"instances":0,"connectors":0,"lockdown":0}},"sitesearch":{"i
nitialized":true,"counts":{"cached":0,"clusters":0,"instances":0,"connectors":0,"lockdown":0}},"kibana":{"initialized":true,"counts":{"cached":34,"clusters":4,"instances":1
,"connectors":3,"lockdown":0}},"apm":{"initialized":true,"counts":{"cached":1,"clusters":0,"instances":0,"connectors":0,"lockdown":0}}}},30064771846,30064778965,15843744246
93,1584452166562,27,0,0,793365334931079188,1621,0,30064771846
), 10.5.3.153 -> ProxyPublicInstance({"metadata":{},"attributes":{"proxy_host_ip":"10.5.3.153","proxy_public_hostname":"10.5.3.153","zone":"MY_ZONE-3"},"v2proxyhealth":{"ro
uting_table":{"elasticsearch":{"deployments":"6","instances":"14"},"kibana":{"deployments":"4","instances":"1"}}},"health":{"healthy":true,"curator":{"state":"CONNECTED","s
ince":"2020-03-16T16:00:26.179Z"},"dynamic_settings":{"initialized":true,"settings":{"http":{"cluster":{"max_outstanding_requests":2000},"cookies":{"cookie_secret":"-","use
r_cookie_key":"user"},"sso":{"sso_secret":"-","max_age":3600000,"cookie_name":"found-sso","maintenance_bypass_cookie_name":"found-maintenance-bypass","default_redirect_path
":"/","dont_log_requests":false},"dashboards":{"base_url":"http://d1wko2d5vgoj53.cloudfront.net/_found.no/dashboards"},"healthcheck":{"minimum_proxy_services":1,"disconnect
ed_cutoff":259200000}},"signature_valid_for_millis":60000,"signature_secret":"-","signature_future_signed_tolerance_millis":0}},"appsearch":{"initialized":true,"counts":{"cached":1,"clusters":0,"instances":0,"connectors":0,"lockdown":0}},"elasticsearch":{"initialized":true,"counts":{"cached":81,"clusters":6,"instances":14,"connectors":14,"lockdown":0}},"enterprisesearch":{"initialized":true,"counts":{"cached":0,"clusters":0,"instances":0,"connectors":0,"lockdown":0}},"sitesearch":{"initialized":true,"counts":{"cached":0,"clusters":0,"instances":0,"connectors":0,"lockdown":0}},"kibana":{"initialized":true,"counts":{"cached":34,"clusters":4,"instances":1,"connectors":3,"lockdown":0}},"apm":{"initialized":true,"counts":{"cached":1,"clusters":0,"instances":0,"connectors":0,"lockdown":0}}}},30064771852,30064778972,1584374426719,1584452166634,30,0,0,793365334931079194,1619,0,30064771852
), 10.5.6.186 -> ProxyPublicInstance({"metadata":{},"attributes":{"proxy_host_ip":"10.5.6.186","proxy_public_hostname":"10.5.6.186","zone":"MY_ZONE-1"},"v2proxyhealth":{"routing_table":{"elasticsearch":{"deployments":"6","instances":"14"},"kibana":{"deployments":"4","instances":"1"}}},"health":{"healthy":true,"curator":{"state":"RECONNECTED","since":"2020-03-16T16:00:43.673Z"},"dynamic_settings":{"initialized":true,"settings":{"http":{"cluster":{"max_outstanding_requests":2000},"cookies":{"cookie_secret":"-","user_cookie_key":"user"},"sso":{"sso_secret":"-","max_age":3600000,"cookie_name":"found-sso","maintenance_bypass_cookie_name":"found-maintenance-bypass","default_redirect_path":"/","dont_log_requests":false},"dashboards":{"base_url":"http://d1wko2d5vgoj53.cloudfront.net/_found.no/dashboards"},"healthcheck":{"minimum_proxy_services":1,"disconnected_cutoff":259200000}},"signature_valid_for_millis":60000,"signature_secret":"-","signature_future_signed_tolerance_millis":0}},"appsearch":{"initialized":true,"counts":{"cached":1,"clusters":0,"instances":0,"connectors":0,"lockdown":0}},"elasticsearch":{"initialized":true,"counts":{"cached":81,"clusters":6,"instances":14,"connectors":14,"lockdown":0}},"enterprisesearch":{"initialized":true,"counts":{"cached":0,"clusters":0,"instances":0,"connectors":0,"lockdown":0}},"sitesearch":{"initialized":true,"counts":{"cached":0,"clusters":0,"instances":0,"connectors":0,"lockdown":0}},"kibana":{"initialized":true,"counts":{"cached":34,"clusters":4,"instances":1,"connectors":3,"lockdown":0}},"apm":{"initialized":true,"counts":{"cached":1,"clusters":0,"instances":0,"connectors":0,"lockdown":0}}}},30064771849,30064778967,1584374424694,1584452166564,29,0,0,864691133595844620,1621,0,30064771849
)))], adminconsoles:[AdminConsolePublicInstances(Map())] to [/app/managed/haproxy.cfg] {}
  • does the last bit about admin consoles include an empty Map()?

haproxy.log had this:

[WARNING] 076/140711 (15) : Failed to connect to the old process socket '/app/data/haproxy/haproxy.sock'
[ALERT] 076/140711 (15) : Failed to get the sockets from the old process!
[WARNING] 076/140711 (15) : Reexecuting Master process
[WARNING] 076/140719 (15) : Reexecuting Master process
[WARNING] 076/140719 (15) : Former worker 81 exited with code 0
[WARNING] 076/140719 (15) : Former worker 37 exited with code 0
[WARNING] 076/140719 (15) : Exiting Master process...
[ALERT] 076/140719 (15) : Current worker 83 exited with code 143
[WARNING] 076/140719 (15) : All workers exited. Exiting... (143)
[WARNING] 077/134250 (14) : Failed to connect to the old process socket '/app/data/haproxy/haproxy.sock'
[ALERT] 077/134250 (14) : Failed to get the sockets from the old process!
[WARNING] 077/134250 (14) : Reexecuting Master process
[WARNING] 077/134258 (14) : Former worker 33 exited with code 0

I found the config files, all my servers have the proxy role. I tried the :9200 check on each and they gave the same result - 401 needing auth on each server.
curl -H "X-Found-Cluster: 70bceb62fa7f480395ca2cb9d6573fa3" localhost:9200

@Alex_Piggott & team,

I have been unsuccessful in recovering this ECE platform but it seems some of the clusters are still in existence.

If I was to create a new ECE platform, is there a way to migrate or backup / restore the clusters so as to not lose them?

Oh sorry, I wrote a response to your last post but I just realized it didn't send :frowning:

For reference it was:


Stranger and stranger!

OK can you post the haproxy.cfgin the services-forwarder/managed (I think, one of the subdirs anyway) for each of the AC hosts (together with whether that AC is working or not)

Also can you check the health routes the proxies advertize: https://proxyhost:9243/_health


To answer your later question ... the normal way of migrating to a working system would be to add a new host in "zone1", migrate the "zone1" cluster instances over, then decommission zone1. Then rinse and repeat for the other zones.

Alex,

thanks. I checked each of the AC hosts. According to the GET clusters topology they are all healthy.
10.2.x haproxy.cfg

global
    maxconn 100000
    tune.ssl.default-dh-param 2048

defaults
    option clitcpka
    option srvtcpka
    timeout connect 10s
    timeout client 30s
    timeout server 10m
    option log-health-checks


frontend fe-server-0
    bind *:9244

    mode http
    maxconn 100000
    timeout client 30s
    http-request del-header X-Forwarded-For
    http-request del-header X-Forwarded-Proto
    http-request del-header X-Forwarded-Port
    http-request del-header X-Forwarded-Host
    default_backend be-server-0

backend be-server-0
    mode http
    timeout connect 10s
    timeout server 10m
    balance roundrobin
    option httpchk GET /_health HTTP/1.1\r\nHost:\ 10.5.2.152
    server server-0-0 10.5.2.152:9243 check inter 30s rise 3 fall 2 check-ssl  ssl verify required ca-file /app/managed/proxy-cert.pem
    server server-0-1 10.5.3.153:9243 check inter 30s rise 3 fall 2 check-ssl  ssl verify required ca-file /app/managed/proxy-cert.pem
    server server-0-2 10.5.6.186:9243 check inter 30s rise 3 fall 2 check-ssl  ssl verify required ca-file /app/managed/proxy-cert.pem
frontend fe-server-1
    bind *:12344

    mode http
    maxconn 100000
    timeout client 30s
    http-request del-header X-Forwarded-For
    http-request del-header X-Forwarded-Proto
    http-request del-header X-Forwarded-Port
    http-request del-header X-Forwarded-Host
    default_backend be-server-1

backend be-server-1
    mode http
    timeout connect 10s
    timeout server 10m
    balance roundrobin
    server server-1-0 10.5.6.186:12343 check  ssl verify required ca-file /app/managed/adminconsole-cert.pem
    server server-1-1 10.5.3.153:12343 check  ssl verify required ca-file /app/managed/adminconsole-cert.pem
    server server-1-2 10.5.2.152:12343 check  ssl verify required ca-file /app/managed/adminconsole-cert.pem

10.3.x

global
    maxconn 100000
    tune.ssl.default-dh-param 2048

defaults
    option clitcpka
    option srvtcpka
    timeout connect 10s
    timeout client 30s
    timeout server 10m
    option log-health-checks


frontend fe-server-0
    bind *:9244

    mode http
    maxconn 100000
    timeout client 30s
    http-request del-header X-Forwarded-For
    http-request del-header X-Forwarded-Proto
    http-request del-header X-Forwarded-Port
    http-request del-header X-Forwarded-Host
    default_backend be-server-0

backend be-server-0
    mode http
    timeout connect 10s
    timeout server 10m
    balance roundrobin
    option httpchk GET /_health HTTP/1.1\r\nHost:\ 10.5.2.152
    server server-0-0 10.5.2.152:9243 check inter 30s rise 3 fall 2 check-ssl  ssl verify required ca-file /app/managed/proxy-cert.pem
    server server-0-1 10.5.3.153:9243 check inter 30s rise 3 fall 2 check-ssl  ssl verify required ca-file /app/managed/proxy-cert.pem
    server server-0-2 10.5.6.186:9243 check inter 30s rise 3 fall 2 check-ssl  ssl verify required ca-file /app/managed/proxy-cert.pem

frontend fe-server-1
    bind *:12344

    mode http
    maxconn 100000
    timeout client 30s
    http-request del-header X-Forwarded-For
    http-request del-header X-Forwarded-Proto
    http-request del-header X-Forwarded-Port
    http-request del-header X-Forwarded-Host
    default_backend be-server-1

backend be-server-1
    mode http
    timeout connect 10s
    timeout server 10m
    balance roundrobin
    server server-1-0 10.5.6.186:12343 check  ssl verify required ca-file /app/managed/adminconsole-cert.pem
    server server-1-1 10.5.3.153:12343 check  ssl verify required ca-file /app/managed/adminconsole-cert.pem
    server server-1-2 10.5.2.152:12343 check  ssl verify required ca-file /app/managed/adminconsole-cert.pem

10.6.x

global
    maxconn 100000
    tune.ssl.default-dh-param 2048

defaults
    option clitcpka
    option srvtcpka
    timeout connect 10s
    timeout client 30s
    timeout server 10m
    option log-health-checks


frontend fe-server-0
    bind *:9244

    mode http
    maxconn 100000
    timeout client 30s
    http-request del-header X-Forwarded-For
    http-request del-header X-Forwarded-Proto
    http-request del-header X-Forwarded-Port
    http-request del-header X-Forwarded-Host
    default_backend be-server-0

backend be-server-0
    mode http
    timeout connect 10s
    timeout server 10m
    balance roundrobin
    option httpchk GET /_health HTTP/1.1\r\nHost:\ 10.5.2.152
    server server-0-0 10.5.2.152:9243 check inter 30s rise 3 fall 2 check-ssl  ssl verify required ca-file /app/managed/proxy-cert.pem
    server server-0-1 10.5.3.153:9243 check inter 30s rise 3 fall 2 check-ssl  ssl verify required ca-file /app/managed/proxy-cert.pem
    server server-0-2 10.5.6.186:9243 check inter 30s rise 3 fall 2 check-ssl  ssl verify required ca-file /app/managed/proxy-cert.pem

frontend fe-server-1
    bind *:12344

    mode http
    maxconn 100000
    timeout client 30s
    http-request del-header X-Forwarded-For
    http-request del-header X-Forwarded-Proto
    http-request del-header X-Forwarded-Port
    http-request del-header X-Forwarded-Host
    default_backend be-server-1

backend be-server-1
    mode http
    timeout connect 10s
    timeout server 10m
    balance roundrobin
    server server-1-0 10.5.6.186:12343 check  ssl verify required ca-file /app/managed/adminconsole-cert.pem
    server server-1-1 10.5.3.153:12343 check  ssl verify required ca-file /app/managed/adminconsole-cert.pem
    server server-1-2 10.5.2.152:12343 check  ssl verify required ca-file /app/managed/adminconsole-cert.pem

I wonder if this is relevant... I checked the healthcheck endpoint and if I used curl -k it was fine (ok: true, status: 200) HOWEVER when I tried to verify with --cacert and the pem files given above (admiconsole-cert.pem or proxy-cert.pem) curl reported:

curl: (60) Peer's Certificate has expired.

Could this be the mechanism by which things are UP but there are simultaneously no servers avaliable?

Is it possible that the (external) proxy cert that you uploaded has expired?

A colleague just reminded me that although the AC -> services forwarder -> proxy -> cluster is internal, the ->proxy step uses the external cert

Alex,

thanks for the hints. I have made some progress. I managed to renew the wildcard cert we use for our domain. I used the API to replace the certificate. I restarted services-forwarder and proxyv2.

The healthcheck is still using an old certificate - but an explicit one not the wildcard one, if that makes sense, with ece. instead of *. before the domain.
I had a hunt in the services configs for .pem files and found plenty including the one - only one which had the expired date, 29th Jan. That was the route-server
./route-server/managed/cert.pem
I docker restart-ed the route-server for good measure but no luck.

I'm stumped on how to get the new certificate - or another new certificate - to the route server. I presume the wildcard one I renewed was needed for other reasons but this one is different?