Elastic-operator-0 goes into CrashLoopbackOff possibly after underlying node restarted

dineshabbi · April 15, 2020, 6:00pm

I am running elastic-operator 1.0.0-beta1 and hitting following issue.
image: docker.elastic.co/eck/eck-operator:1.0.0-beta1


❯ kubectl get pods -n elasticsearch                                                                                                                                                   
NAME                                      READY   STATUS             RESTARTS   AGE
ct-es-es-data-nodes-0                     1/1     Running            0          19d
ct-es-es-data-nodes-1                     1/1     Running            0          19d
ct-es-es-data-nodes-2                     1/1     Running            0          19d
ct-kibana-kb-c89445c75-cvvjf              1/1     Running            1          19d
elastic-operator-0                        0/1     CrashLoopBackOff   5565       19d

I think possibly there was an EKS upgrade performed 19 days ago which restarted all the aws nodes in the cluster, but I observe that the operator pod goes into this state throwing these logs.

❯ kubectl logs elastic-operator-0 -n elasticsearch     
                                                                                                                               
{"level":"info","@timestamp":"2020-04-15T17:55:27.084Z","logger":"manager","message":"Setting up client for manager","ver":"1.0.0-beta1-84792e30"}
{"level":"info","@timestamp":"2020-04-15T17:55:27.084Z","logger":"manager","message":"Setting up scheme","ver":"1.0.0-beta1-84792e30"}
{"level":"info","@timestamp":"2020-04-15T17:55:27.085Z","logger":"manager","message":"Setting up manager","ver":"1.0.0-beta1-84792e30"}
{"level":"info","@timestamp":"2020-04-15T17:55:27.590Z","logger":"controller-runtime.metrics","message":"metrics server is starting to listen","ver":"1.0.0-beta1-84792e30","addr":":0"}
{"level":"error","@timestamp":"2020-04-15T17:55:27.592Z","logger":"manager","message":"unable to get operator info","ver":"1.0.0-beta1-84792e30","error":"configmaps \"elastic-operator-uuid\" is forbidden: User \"system:serviceaccount:elasticsearch:elastic-operator\" cannot get resource \"configmaps\" in API group \"\" in the namespace \"elasticsearch\"","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/pkg/mod/github.com/go-logr/zapr@v0.1.0/zapr.go:128\ngithub.com/elastic/cloud-on-k8s/cmd/manager.execute\n\t/go/src/github.com/elastic/cloud-on-k8s/cmd/manager/main.go:254\ngithub.com/elastic/cloud-on-k8s/cmd/manager.glob..func1\n\t/go/src/github.com/elastic/cloud-on-k8s/cmd/manager/main.go:74\ngithub.com/spf13/cobra.(*Command).execute\n\t/go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:830\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\t/go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:914\ngithub.com/spf13/cobra.(*Command).Execute\n\t/go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:864\nmain.main\n\t/go/src/github.com/elastic/cloud-on-k8s/cmd/main.go:27\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:203"}

Can someone help what is the reason and how to recover the operator pod without impacting existing ES cluster ?

Anya_Sabo · April 15, 2020, 8:58pm

You can delete the operator pod without affecting the Elasticsearch resource (though any updates would not be processed while there is effectively no operator). That said, the error looks like the service account elastic-operator does not have permission to read configmaps in the elasticsearch namespace.

Also, some other people noted odd permissions problems when nodes were restarted in some environments (though not this one specifically), that were resolved when they deleted the pod.

dineshabbi · April 16, 2020, 12:32pm

Thanks @Anya_Sabo, deleting the pod did not help. As this is a statefulset , kubernetes will bring the pod back up, resulting into same error during container bringup.
This cluster and namespace were in untouched state until the nodes were restarted.

Can anyone confirm if these problems go away if we move to latest stable operator version 1.0.1 ?

dineshabbi · April 16, 2020, 3:17pm

Also, if I want to respin the cluster, is it possible to attach the old PVs to new cluster pods ? Assuming that I retained the old PVs during the teardown of the old cluster. Any documentation pointer would help to achieve this.

sebgl · April 20, 2020, 12:40pm

deleting the pod did not help

I suggest you reapply ECK's yaml manifests. It looks like something's broken in the RBAC setup? Reapplying the manifests should normally be a no-op.
You can also delete ECK statefulset and reapply the manifests.

Also, if I want to respin the cluster, is it possible to attach the old PVs to new cluster pods ? Assuming that I retained the old PVs during the teardown of the old cluster. Any documentation pointer would help to achieve this.

We don't officially support this since it can easily be messed up. However, we still built this tool for that particular use case: https://github.com/elastic/cloud-on-k8s/tree/master/hack/reattach-pv.

If you removed an existing Elasticsearch resource, but the PVs are still around, you can recreate a new cluster and reattach existing PVs by running the tool. Note the cluster must have the exact same spec (same nodeSets, same count, same config, same name, etc.).

In general, we advise relying on Elasticsearch snapshots instead.

Daniel_Hruby · April 21, 2020, 4:15pm

I have the same problem. In my Kubernetes cluster I already have one Elasticsearch cluster and is working great.

Today I needed to create another Elasticsearch cluster but I'm still getting error same error as above. And I use the same manifests as for the first cluster just change names. When I check role permissions everything looks good. But elastic operator can not create configmap elastic-operator-uuid.

But Elastichsearch and Kibana have phase ready and health green.

dineshabbi · April 22, 2020, 5:50am

Daniel, the solution posted by @sebgl above worked perfectly for me. We need to understand the consequences of losing the ECK statefulset though.

Topic		Replies	Views
Automatically elastic operator pod went down and up Elastic Cloud on Kubernetes (ECK)	5	1133	November 4, 2022
Cluster never recovers on baremetal cloud-on-k8s instance Elastic Cloud on Kubernetes (ECK)	5	1760	November 4, 2022
Elasticsearch Resource creation failing with ECK operator Elastic Cloud on Kubernetes (ECK)	1	901	April 14, 2021
Operator Upgrade causes nodes to be restarted Elastic Cloud on Kubernetes (ECK)	3	387	November 4, 2022
One of master pod is not getting restarted Elastic Cloud on Kubernetes (ECK)	4	474	November 4, 2022

Elastic-operator-0 goes into CrashLoopbackOff possibly after underlying node restarted

Related topics