Elastic-operator-0 goes into CrashLoopbackOff possibly after underlying node restarted

I am running elastic-operator 1.0.0-beta1 and hitting following issue.
image: docker.elastic.co/eck/eck-operator:1.0.0-beta1


❯ kubectl get pods -n elasticsearch                                                                                                                                                   
NAME                                      READY   STATUS             RESTARTS   AGE
ct-es-es-data-nodes-0                     1/1     Running            0          19d
ct-es-es-data-nodes-1                     1/1     Running            0          19d
ct-es-es-data-nodes-2                     1/1     Running            0          19d
ct-kibana-kb-c89445c75-cvvjf              1/1     Running            1          19d
elastic-operator-0                        0/1     CrashLoopBackOff   5565       19d

I think possibly there was an EKS upgrade performed 19 days ago which restarted all the aws nodes in the cluster, but I observe that the operator pod goes into this state throwing these logs.

❯ kubectl logs elastic-operator-0 -n elasticsearch     
                                                                                                                               
{"level":"info","@timestamp":"2020-04-15T17:55:27.084Z","logger":"manager","message":"Setting up client for manager","ver":"1.0.0-beta1-84792e30"}
{"level":"info","@timestamp":"2020-04-15T17:55:27.084Z","logger":"manager","message":"Setting up scheme","ver":"1.0.0-beta1-84792e30"}
{"level":"info","@timestamp":"2020-04-15T17:55:27.085Z","logger":"manager","message":"Setting up manager","ver":"1.0.0-beta1-84792e30"}
{"level":"info","@timestamp":"2020-04-15T17:55:27.590Z","logger":"controller-runtime.metrics","message":"metrics server is starting to listen","ver":"1.0.0-beta1-84792e30","addr":":0"}
{"level":"error","@timestamp":"2020-04-15T17:55:27.592Z","logger":"manager","message":"unable to get operator info","ver":"1.0.0-beta1-84792e30","error":"configmaps \"elastic-operator-uuid\" is forbidden: User \"system:serviceaccount:elasticsearch:elastic-operator\" cannot get resource \"configmaps\" in API group \"\" in the namespace \"elasticsearch\"","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/pkg/mod/github.com/go-logr/zapr@v0.1.0/zapr.go:128\ngithub.com/elastic/cloud-on-k8s/cmd/manager.execute\n\t/go/src/github.com/elastic/cloud-on-k8s/cmd/manager/main.go:254\ngithub.com/elastic/cloud-on-k8s/cmd/manager.glob..func1\n\t/go/src/github.com/elastic/cloud-on-k8s/cmd/manager/main.go:74\ngithub.com/spf13/cobra.(*Command).execute\n\t/go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:830\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\t/go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:914\ngithub.com/spf13/cobra.(*Command).Execute\n\t/go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:864\nmain.main\n\t/go/src/github.com/elastic/cloud-on-k8s/cmd/main.go:27\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:203"}

Can someone help what is the reason and how to recover the operator pod without impacting existing ES cluster ?

You can delete the operator pod without affecting the Elasticsearch resource (though any updates would not be processed while there is effectively no operator). That said, the error looks like the service account elastic-operator does not have permission to read configmaps in the elasticsearch namespace.

Also, some other people noted odd permissions problems when nodes were restarted in some environments (though not this one specifically), that were resolved when they deleted the pod.

Thanks @Anya_Sabo, deleting the pod did not help. As this is a statefulset , kubernetes will bring the pod back up, resulting into same error during container bringup.
This cluster and namespace were in untouched state until the nodes were restarted.

Can anyone confirm if these problems go away if we move to latest stable operator version 1.0.1 ?

Also, if I want to respin the cluster, is it possible to attach the old PVs to new cluster pods ? Assuming that I retained the old PVs during the teardown of the old cluster. Any documentation pointer would help to achieve this.

deleting the pod did not help

I suggest you reapply ECK's yaml manifests. It looks like something's broken in the RBAC setup? Reapplying the manifests should normally be a no-op.
You can also delete ECK statefulset and reapply the manifests.

Also, if I want to respin the cluster, is it possible to attach the old PVs to new cluster pods ? Assuming that I retained the old PVs during the teardown of the old cluster. Any documentation pointer would help to achieve this.

We don't officially support this since it can easily be messed up. However, we still built this tool for that particular use case: https://github.com/elastic/cloud-on-k8s/tree/master/hack/reattach-pv.

If you removed an existing Elasticsearch resource, but the PVs are still around, you can recreate a new cluster and reattach existing PVs by running the tool. Note the cluster must have the exact same spec (same nodeSets, same count, same config, same name, etc.).

In general, we advise relying on Elasticsearch snapshots instead.

1 Like

I have the same problem. In my Kubernetes cluster I already have one Elasticsearch cluster and is working great.

Today I needed to create another Elasticsearch cluster but I'm still getting error same error as above. And I use the same manifests as for the first cluster just change names. When I check role permissions everything looks good. But elastic operator can not create configmap elastic-operator-uuid.

But Elastichsearch and Kibana have phase ready and health green.

Daniel, the solution posted by @sebgl above worked perfectly for me. We need to understand the consequences of losing the ECK statefulset though.