I have cluster where there is no master able to be found since about 4 days ago and I don't know how it got into this state or how to recover. I am still pretty green on this stack and have tried a few things such as deleting dedicated master pods 1 at a time and am concerned I might have pushed it into an unrecoverable state.
I was using ECK 1.2.1 and have since optimistically upgraded to 1.3.0. The ECK operator logs suggest that it cannot update settings or delete voting exclusions.
Elastic Operator Error
2020-11-23T17:06:05.105390197Z {"log.level":"info","@timestamp":"2020-11-23T17:06:05.105Z","log.logger":"elasticsearch-controller","message":"Starting reconciliation run","service.version":"1.3.0+6db1914b","service.type":"eck","ecs.version":"1.4.0","iteration":74,"namespace":"prod-us-a-elk","es_name":"prod"}
2020-11-23T17:06:35.329307222Z {"log.level":"info","@timestamp":"2020-11-23T17:06:35.329Z","log.logger":"driver","message":"Could not update cluster license","service.version":"1.3.0+6db1914b","service.type":"eck","ecs.version":"1.4.0","err":"while getting current license level 503 Service Unavailable: ","namespace":"prod-us-a-elk","es_name":"prod"}
2020-11-23T17:07:05.331496621Z {"log.level":"error","@timestamp":"2020-11-23T17:07:05.331Z","log.logger":"driver","message":"Could not update remote clusters in Elasticsearch settings","service.version":"1.3.0+6db1914b","service.type":"eck","ecs.version":"1.4.0","namespace":"prod-us-a-elk","es_name":"prod","error":"503 Service Unavailable: ","error.stack_trace":"github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/driver.(*defaultDriver).Reconcile\n\t/go/src/github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/driver/driver.go:214\ngithub.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch.(*ReconcileElasticsearch).internalReconcile\n\t/go/src/github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/elasticsearch_controller.go:288\ngithub.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch.(*ReconcileElasticsearch).Reconcile\n\t/go/src/github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/elasticsearch_controller.go:195\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.6.3/pkg/internal/controller/controller.go:244\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.6.3/pkg/internal/controller/controller.go:218\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.6.3/pkg/internal/controller/controller.go:197\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/go/pkg/mod/k8s.io/apimachinery@v0.18.6/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/go/pkg/mod/k8s.io/apimachinery@v0.18.6/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/pkg/mod/k8s.io/apimachinery@v0.18.6/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/go/pkg/mod/k8s.io/apimachinery@v0.18.6/pkg/util/wait/wait.go:90"}
2020-11-23T17:07:05.331534833Z {"log.level":"info","@timestamp":"2020-11-23T17:07:05.331Z","log.logger":"keystore","message":"Secure settings secret not found","service.version":"1.3.0+6db1914b","service.type":"eck","ecs.version":"1.4.0","namespace":"prod-us-a-elk","secret_name":"elastic-iam-creds"}
2020-11-23T17:07:05.351359001Z {"log.level":"info","@timestamp":"2020-11-23T17:07:05.351Z","log.logger":"driver","message":"Limiting master nodes creation to one at a time","service.version":"1.3.0+6db1914b","service.type":"eck","ecs.version":"1.4.0","namespace":"prod-us-a-elk","statefulset_name":"prod-es-master-az3-002","target":1,"actual":0}
2020-11-23T17:07:05.356672795Z {"log.level":"info","@timestamp":"2020-11-23T17:07:05.356Z","log.logger":"zen2","message":"Ensuring no voting exclusions are set","service.version":"1.3.0+6db1914b","service.type":"eck","ecs.version":"1.4.0","namespace":"prod-us-a-elk","es_name":"prod"}
2020-11-23T17:07:11.635131596Z {"log.level":"info","@timestamp":"2020-11-23T17:07:11.634Z","log.logger":"generic-reconciler","message":"Updating resource","service.version":"1.3.0+6db1914b","service.type":"eck","ecs.version":"1.4.0","kind":"ConfigMap","namespace":"elastic-system","name":"elastic-licensing"}
2020-11-23T17:07:35.358727885Z {"log.level":"info","@timestamp":"2020-11-23T17:07:35.358Z","log.logger":"elasticsearch-controller","message":"Ending reconciliation run","service.version":"1.3.0+6db1914b","service.type":"eck","ecs.version":"1.4.0","iteration":74,"namespace":"prod-us-a-elk","es_name":"prod","took":90.253348031}
2020-11-23T17:07:35.358974810Z {"log.level":"error","@timestamp":"2020-11-23T17:07:35.358Z","log.logger":"controller","message":"Reconciler error","service.version":"1.3.0+6db1914b","service.type":"eck","ecs.version":"1.4.0","controller":"elasticsearch-controller","name":"prod","namespace":"prod-us-a-elk","error":"unable to delete /_cluster/voting_config_exclusions: 503 Service Unavailable: ","errorCauses":[{"error":"unable to delete /_cluster/voting_config_exclusions: 503 Service Unavailable: unknown","errorVerbose":"503 Service Unavailable: unknown\nunable to delete /_cluster/voting_config_exclusions\ngithub.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/client.(*clientV7).DeleteVotingConfigExclusions\n\t/go/src/github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/client/v7.go:60\ngithub.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/version/zen2.ClearVotingConfigExclusions\n\t/go/src/github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/version/zen2/voting_exclusions.go:72\ngithub.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/driver.(*defaultDriver).reconcileNodeSpecs\n\t/go/src/github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/driver/nodes.go:147\ngithub.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/driver.(*defaultDriver).Reconcile\n\t/go/src/github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/driver/driver.go:249\ngithub.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch.(*ReconcileElasticsearch).internalReconcile\n\t/go/src/github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/elasticsearch_controller.go:288\ngithub.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch.(*ReconcileElasticsearch).Reconcile\n\t/go/src/github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/elasticsearch_controller.go:195\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.6.3/pkg/internal/controller/controller.go:244\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.6.3/pkg/internal/controller/controller.go:218\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.6.3/pkg/internal/controller/controller.go:197\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/go/pkg/mod/k8s.io/apimachinery@v0.18.6/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/go/pkg/mod/k8s.io/apimachinery@v0.18.6/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/pkg/mod/k8s.io/apimachinery@v0.18.6/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/go/pkg/mod/k8s.io/apimachinery@v0.18.6/pkg/util/wait/wait.go:90\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1374"}],"error.stack_trace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.6.3/pkg/internal/controller/controller.go:246\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.6.3/pkg/internal/controller/controller.go:218\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.6.3/pkg/internal/controller/controller.go:197\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/go/pkg/mod/k8s.io/apimachinery@v0.18.6/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/go/pkg/mod/k8s.io/apimachinery@v0.18.6/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/pkg/mod/k8s.io/apimachinery@v0.18.6/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/go/pkg/mod/k8s.io/apimachinery@v0.18.6/pkg/util/wait/wait.go:90"}
Both of the dedicated master logs are also showing errors that it could not bootstap itself and then I see a lot of logs about certificates being rejected.
2020-11-23T17:27:18.376844210Z {"type": "server", "timestamp": "2020-11-23T17:27:18,376Z", "level": "WARN", "component": "o.e.c.c.ClusterFormationFailureHelper", "cluster.name": "prod", "node.name": "prod-es-master-az1-002-0", "message": "master not discovered yet, this node has not previously joined a bootstrapped (v7+) cluster, and [cluster.initial_master_nodes] is empty on this node: have discovered [{prod-es-master-az1-002-0}{fZGZbr4bTy2sE_QZBJSGsQ}{jouSOeHjSv2X7FRJGbzXiA}{10.110.128.97}{10.110.128.97:9300}{mr}{k8s_node_name=ip-10-110-129-111.ec2.internal, xpack.installed=true, zone=us-east-1a, transform.node=false}, {prod-es-master-az2-002-0}{zMDiIVc0R1KUeu7xJkdTQQ}{10B8NLYiQWqm-UjhK7d0_w}{10.110.139.103}{10.110.139.103:9300}{mr}{zone=us-east-1b, xpack.installed=true, transform.node=false}]; discovery will continue using [127.0.0.1:9300, 127.0.0.1:9301, 127.0.0.1:9302, 127.0.0.1:9303, 127.0.0.1:9304, 127.0.0.1:9305, 10.110.139.103:9300] from hosts providers and [{prod-es-master-az1-002-0}{fZGZbr4bTy2sE_QZBJSGsQ}{jouSOeHjSv2X7FRJGbzXiA}{10.110.128.97}{10.110.128.97:9300}{mr}{k8s_node_name=ip-10-110-129-111.ec2.internal, xpack.installed=true, zone=us-east-1a, transform.node=false}] from last-known cluster state; node term 0, last-accepted version 0 in term 0" }
2020-11-23T17:28:09.430475969Z {"type": "server", "timestamp": "2020-11-23T17:28:09,430Z", "level": "WARN", "component": "o.e.x.s.t.n.SecurityNetty4HttpServerTransport", "cluster.name": "prod", "node.name": "prod-es-master-az1-002-0", "message": "http client did not trust this server's certificate, closing connection Netty4HttpChannel{localAddress=/10.110.128.97:9200, remoteAddress=/10.110.136.97:52208}" }
I have also tried manually editing the elasticsearch.yml to include the initial_masater_nodes but unsurprising ECK overrides that config.