ECK Operator stuck in ApplyingChanges state after upgrade to 1.4.0

I recently upgraded to 1.4.0 and the operator is stuck in the ApplyingChanges state. I expanded a few volumes on data nodes which caused the cluster to enter a yellow state temporarily however the operator stayed green the whole time. It has been over 12 hours since the upgrade to 1.4.0 (and the cluster has been in a green state since then:

$ k get es
NAME     HEALTH   NODES   VERSION   PHASE             AGE
search   green    63      6.8.6     ApplyingChanges   289d

I don't see anything relevant in the logs for the operator:

{"log.level":"info","@timestamp":"2021-02-28T14:25:52.632Z","log.logger":"generic-reconciler","message":"Updating resource","service.version":"1.4.0+4aff0b98","service.type":"eck","ecs.version":"1.4.0","kind":"ConfigMap","namespace":"elastic-system","name":"elastic-licensing"}
{"log.level":"info","@timestamp":"2021-02-28T14:27:52.632Z","log.logger":"generic-reconciler","message":"Updating resource","service.version":"1.4.0+4aff0b98","service.type":"eck","ecs.version":"1.4.0","kind":"ConfigMap","namespace":"elastic-system","name":"elastic-licensing"}
{"log.level":"info","@timestamp":"2021-02-28T14:29:52.632Z","log.logger":"generic-reconciler","message":"Updating resource","service.version":"1.4.0+4aff0b98","service.type":"eck","ecs.version":"1.4.0","kind":"ConfigMap","namespace":"elastic-system","name":"elastic-licensing"}
{"log.level":"info","@timestamp":"2021-02-28T14:31:52.632Z","log.logger":"generic-reconciler","message":"Updating resource","service.version":"1.4.0+4aff0b98","service.type":"eck","ecs.version":"1.4.0","kind":"ConfigMap","namespace":"elastic-system","name":"elastic-licensing"}
{"log.level":"info","@timestamp":"2021-02-28T14:33:52.632Z","log.logger":"generic-reconciler","message":"Updating resource","service.version":"1.4.0+4aff0b98","service.type":"eck","ecs.version":"1.4.0","kind":"ConfigMap","namespace":"elastic-system","name":"elastic-licensing"}
{"log.level":"info","@timestamp":"2021-02-28T14:35:52.632Z","log.logger":"generic-reconciler","message":"Updating resource","service.version":"1.4.0+4aff0b98","service.type":"eck","ecs.version":"1.4.0","kind":"ConfigMap","namespace":"elastic-system","name":"elastic-licensing"}
{"log.level":"info","@timestamp":"2021-02-28T14:37:52.632Z","log.logger":"generic-reconciler","message":"Updating resource","service.version":"1.4.0+4aff0b98","service.type":"eck","ecs.version":"1.4.0","kind":"ConfigMap","namespace":"elastic-system","name":"elastic-licensing"}

It looks like the es cluster has an invalid annotation:

apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
  annotations:
    common.k8s.elastic.co/controller-version: 0.0.0-UNKNOWN

This was an upgrade from 1.2.x to 1.4.0 -- should I try downgrading to 1.3.x and then back to 1.4.0 to resolve the inconsistency, or perhaps edit the annotation to something suitable?

When I change the manifest I see the following in the log:

{"log.level":"info","@timestamp":"2021-02-28T20:00:51.333Z","log.logger":"elasticsearch-controller","message":"Starting reconciliation run","service.version":"1.4.0+4aff0b98","service.type":"eck","ecs.version":"1.4.0","iteration":3,"namespace":"default","es_name":"search"}
{"log.level":"info","@timestamp":"2021-02-28T20:00:51.333Z","log.logger":"annotation","message":"Resource was created with older version of operator, will not take action","service.version":"1.4.0+4aff0b98","service.type":"eck","ecs.version":"1.4.0","controller_version":"1.4.0","resource_controller_version":"0.0.0-UNKNOWN","namespace":"default","name":"search"}
{"log.level":"info","@timestamp":"2021-02-28T20:00:51.333Z","log.logger":"elasticsearch-controller","message":"Ending reconciliation run","service.version":"1.4.0+4aff0b98","service.type":"eck","ecs.version":"1.4.0","iteration":3,"namespace":"default","es_name":"search","took":0.000139559}

I ended up setting the common.k8s.elastic.co/controller-version annotation to 1.2.2 (the previous version I upgraded from) and it seems that the operator picked this up as "valid" and started applying requested changes.

The Elasticsearch cluster existed already in 1.2.2? Or did you create it only after upgrading to 1.4.0?
Also: was there a point in time where both instances of the operator ran at the same time?

The cluster existed already in 1.2.2

There should have only been 1 operator at a time, I did the following to upgrade from 1.2.2:

kubectl apply -f https://download.elastic.co/downloads/eck/1.4.0/all-in-one.yaml

Also, I have a yaml file that I apply to edit the cluster -- should I keep the annotation in that yaml file?

That should not be necessary. As long as you use kubectl apply the content of that file will be merged with the state on the API server and the annotation should be preserved.

I am honestly a bit baffled by this error, given that you say there is nothing of interest in the operator logs. Can you check if you have the following log statement in the logs:

Resource was previously reconciled by incompatible controller version and missing annotation, adding annotation

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.