Unable to discover master and lots of certificate errors

I have cluster where there is no master able to be found since about 4 days ago and I don't know how it got into this state or how to recover. I am still pretty green on this stack and have tried a few things such as deleting dedicated master pods 1 at a time and am concerned I might have pushed it into an unrecoverable state.

I was using ECK 1.2.1 and have since optimistically upgraded to 1.3.0. The ECK operator logs suggest that it cannot update settings or delete voting exclusions.

Elastic Operator Error

2020-11-23T17:06:05.105390197Z {"log.level":"info","@timestamp":"2020-11-23T17:06:05.105Z","log.logger":"elasticsearch-controller","message":"Starting reconciliation run","service.version":"1.3.0+6db1914b","service.type":"eck","ecs.version":"1.4.0","iteration":74,"namespace":"prod-us-a-elk","es_name":"prod"}
2020-11-23T17:06:35.329307222Z {"log.level":"info","@timestamp":"2020-11-23T17:06:35.329Z","log.logger":"driver","message":"Could not update cluster license","service.version":"1.3.0+6db1914b","service.type":"eck","ecs.version":"1.4.0","err":"while getting current license level 503 Service Unavailable: ","namespace":"prod-us-a-elk","es_name":"prod"}
2020-11-23T17:07:05.331496621Z {"log.level":"error","@timestamp":"2020-11-23T17:07:05.331Z","log.logger":"driver","message":"Could not update remote clusters in Elasticsearch settings","service.version":"1.3.0+6db1914b","service.type":"eck","ecs.version":"1.4.0","namespace":"prod-us-a-elk","es_name":"prod","error":"503 Service Unavailable: ","error.stack_trace":"github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/driver.(*defaultDriver).Reconcile\n\t/go/src/github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/driver/driver.go:214\ngithub.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch.(*ReconcileElasticsearch).internalReconcile\n\t/go/src/github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/elasticsearch_controller.go:288\ngithub.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch.(*ReconcileElasticsearch).Reconcile\n\t/go/src/github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/elasticsearch_controller.go:195\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.6.3/pkg/internal/controller/controller.go:244\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.6.3/pkg/internal/controller/controller.go:218\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.6.3/pkg/internal/controller/controller.go:197\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/go/pkg/mod/k8s.io/apimachinery@v0.18.6/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/go/pkg/mod/k8s.io/apimachinery@v0.18.6/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/pkg/mod/k8s.io/apimachinery@v0.18.6/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/go/pkg/mod/k8s.io/apimachinery@v0.18.6/pkg/util/wait/wait.go:90"}
2020-11-23T17:07:05.331534833Z {"log.level":"info","@timestamp":"2020-11-23T17:07:05.331Z","log.logger":"keystore","message":"Secure settings secret not found","service.version":"1.3.0+6db1914b","service.type":"eck","ecs.version":"1.4.0","namespace":"prod-us-a-elk","secret_name":"elastic-iam-creds"}
2020-11-23T17:07:05.351359001Z {"log.level":"info","@timestamp":"2020-11-23T17:07:05.351Z","log.logger":"driver","message":"Limiting master nodes creation to one at a time","service.version":"1.3.0+6db1914b","service.type":"eck","ecs.version":"1.4.0","namespace":"prod-us-a-elk","statefulset_name":"prod-es-master-az3-002","target":1,"actual":0}
2020-11-23T17:07:05.356672795Z {"log.level":"info","@timestamp":"2020-11-23T17:07:05.356Z","log.logger":"zen2","message":"Ensuring no voting exclusions are set","service.version":"1.3.0+6db1914b","service.type":"eck","ecs.version":"1.4.0","namespace":"prod-us-a-elk","es_name":"prod"}
2020-11-23T17:07:11.635131596Z {"log.level":"info","@timestamp":"2020-11-23T17:07:11.634Z","log.logger":"generic-reconciler","message":"Updating resource","service.version":"1.3.0+6db1914b","service.type":"eck","ecs.version":"1.4.0","kind":"ConfigMap","namespace":"elastic-system","name":"elastic-licensing"}
2020-11-23T17:07:35.358727885Z {"log.level":"info","@timestamp":"2020-11-23T17:07:35.358Z","log.logger":"elasticsearch-controller","message":"Ending reconciliation run","service.version":"1.3.0+6db1914b","service.type":"eck","ecs.version":"1.4.0","iteration":74,"namespace":"prod-us-a-elk","es_name":"prod","took":90.253348031}
2020-11-23T17:07:35.358974810Z {"log.level":"error","@timestamp":"2020-11-23T17:07:35.358Z","log.logger":"controller","message":"Reconciler error","service.version":"1.3.0+6db1914b","service.type":"eck","ecs.version":"1.4.0","controller":"elasticsearch-controller","name":"prod","namespace":"prod-us-a-elk","error":"unable to delete /_cluster/voting_config_exclusions: 503 Service Unavailable: ","errorCauses":[{"error":"unable to delete /_cluster/voting_config_exclusions: 503 Service Unavailable: unknown","errorVerbose":"503 Service Unavailable: unknown\nunable to delete /_cluster/voting_config_exclusions\ngithub.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/client.(*clientV7).DeleteVotingConfigExclusions\n\t/go/src/github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/client/v7.go:60\ngithub.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/version/zen2.ClearVotingConfigExclusions\n\t/go/src/github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/version/zen2/voting_exclusions.go:72\ngithub.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/driver.(*defaultDriver).reconcileNodeSpecs\n\t/go/src/github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/driver/nodes.go:147\ngithub.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/driver.(*defaultDriver).Reconcile\n\t/go/src/github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/driver/driver.go:249\ngithub.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch.(*ReconcileElasticsearch).internalReconcile\n\t/go/src/github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/elasticsearch_controller.go:288\ngithub.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch.(*ReconcileElasticsearch).Reconcile\n\t/go/src/github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/elasticsearch_controller.go:195\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.6.3/pkg/internal/controller/controller.go:244\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.6.3/pkg/internal/controller/controller.go:218\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.6.3/pkg/internal/controller/controller.go:197\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/go/pkg/mod/k8s.io/apimachinery@v0.18.6/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/go/pkg/mod/k8s.io/apimachinery@v0.18.6/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/pkg/mod/k8s.io/apimachinery@v0.18.6/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/go/pkg/mod/k8s.io/apimachinery@v0.18.6/pkg/util/wait/wait.go:90\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1374"}],"error.stack_trace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.6.3/pkg/internal/controller/controller.go:246\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.6.3/pkg/internal/controller/controller.go:218\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.6.3/pkg/internal/controller/controller.go:197\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/go/pkg/mod/k8s.io/apimachinery@v0.18.6/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/go/pkg/mod/k8s.io/apimachinery@v0.18.6/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/pkg/mod/k8s.io/apimachinery@v0.18.6/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/go/pkg/mod/k8s.io/apimachinery@v0.18.6/pkg/util/wait/wait.go:90"}

Both of the dedicated master logs are also showing errors that it could not bootstap itself and then I see a lot of logs about certificates being rejected.

2020-11-23T17:27:18.376844210Z {"type": "server", "timestamp": "2020-11-23T17:27:18,376Z", "level": "WARN", "component": "o.e.c.c.ClusterFormationFailureHelper", "cluster.name": "prod", "node.name": "prod-es-master-az1-002-0", "message": "master not discovered yet, this node has not previously joined a bootstrapped (v7+) cluster, and [cluster.initial_master_nodes] is empty on this node: have discovered [{prod-es-master-az1-002-0}{fZGZbr4bTy2sE_QZBJSGsQ}{jouSOeHjSv2X7FRJGbzXiA}{10.110.128.97}{10.110.128.97:9300}{mr}{k8s_node_name=ip-10-110-129-111.ec2.internal, xpack.installed=true, zone=us-east-1a, transform.node=false}, {prod-es-master-az2-002-0}{zMDiIVc0R1KUeu7xJkdTQQ}{10B8NLYiQWqm-UjhK7d0_w}{10.110.139.103}{10.110.139.103:9300}{mr}{zone=us-east-1b, xpack.installed=true, transform.node=false}]; discovery will continue using [127.0.0.1:9300, 127.0.0.1:9301, 127.0.0.1:9302, 127.0.0.1:9303, 127.0.0.1:9304, 127.0.0.1:9305, 10.110.139.103:9300] from hosts providers and [{prod-es-master-az1-002-0}{fZGZbr4bTy2sE_QZBJSGsQ}{jouSOeHjSv2X7FRJGbzXiA}{10.110.128.97}{10.110.128.97:9300}{mr}{k8s_node_name=ip-10-110-129-111.ec2.internal, xpack.installed=true, zone=us-east-1a, transform.node=false}] from last-known cluster state; node term 0, last-accepted version 0 in term 0" }
2020-11-23T17:28:09.430475969Z {"type": "server", "timestamp": "2020-11-23T17:28:09,430Z", "level": "WARN", "component": "o.e.x.s.t.n.SecurityNetty4HttpServerTransport", "cluster.name": "prod", "node.name": "prod-es-master-az1-002-0", "message": "http client did not trust this server's certificate, closing connection Netty4HttpChannel{localAddress=/10.110.128.97:9200, remoteAddress=/10.110.136.97:52208}" }

I have also tried manually editing the elasticsearch.yml to include the initial_masater_nodes but unsurprising ECK overrides that config.

I discovered the issue with the certificate errors was just a separate issue.

I had not realized we duplicated the public cert configmap to another namespace where logstash ran but that was unmanaged by the elastic-operator meaning when the rotation happened it still had the hold info.

I still don't understand how my cluster ended up masterless or how to recover from this state. Most things I read say to set the cluster.initial_master_nodes but that is a setting managed by the operator and doesn't seem configurable. Is it likely have a way to bring this cluster back?

I am assuming you mean the Kubernetes secret which contains the certs for Elasticsearch here?
If you copy those without removing the owner reference you will run into a Kubernetes bug that leads to data loss as resources in your Elasticsearch cluster will be deleted by the garbage collector. This is consistent with your observation that the master nodes are claiming that they never joined a cluster before, which can mean that they have lost the persistent volumes where they store their state.

We have documented this problem here Common problems | Elastic Cloud on Kubernetes [2.10] | Elastic

Please be careful when copying objects created by the operator into other namespaces and remove the owner references first.

1 Like

@pebrc Yes it was a kubernetes secret and that does look like the bug we ran into.

Wow, that is a pretty brutal manifestation of that bug. Glad to have learned this lesson for an internal application and that is still in alpha. Thank you for the insight.