Hi - We are trying to upgrade our elastic operator to 2.6.1 from 1.9, and subsequently our elastic cluster deployed in K8s and managed by the operator from 7.17 to 8.6.2.
Upgrading the operator to 2.6.1 intially posed no issues. Same for when we upgraded our underlying ES image to 8.6.2. but when we upgraded the Elasticsearch CRD to have version 8.6.2 in its spec and applied the CRD changes to k8s, the operator started erroring out on the Reconciliation run with the following error:
{"log.level":"error","@timestamp":"2023-03-06T20:54:38.612Z","log.logger":"manager.eck-operator","message":"Reconciler error","service.version":"2.6.1+62f2e278","service.type":"eck","ecs.version":"1.4.0","controller":"elasticsearch-controller","object":{"name":"api","namespace":"qa-xxxxx-elastic"},"namespace":"qa-xxxx-elastic","name":"api","reconcileID":"0f3ec496-5582-4c4a-85ff-0a320c29e171","error":"elasticsearch client failed for https://api-es-internal-http.qa-xxxxx-elastic.svc:9200/_internal/desired_nodes/816e471c-4b7c-4322-8e25-9c52e8fbdc82/1: 400 Bad **Request: {Status:400 Error:{CausedBy:{Reason: Type:} Reason:Nodes with ids [api-es-master-0,api-es-master-1,api-es-master-2,api-es-coordinator-0,api-es-coordinator-1,api-es-data-0,api-es-data-1,api-es-data-2] in positions [0,1,2,3,4,5,6,7] contain invalid settings Type:illegal_argument_exception StackTrace: RootCause:[{Reason:Nodes with ids [api-es-master-0,api-es-master-1,api-es-master-2,api-es-coordinator-0,api-es-coordinator-1,api-es-data-0,api-es-data-1,api-es-data-2] in positions [0,1,2,3,4,5,6,7] contain invalid settings Type:illegal_argument_exception}]}}**","errorCauses":[{"error":"elasticsearch client failed for https://api-es-internal-http.qa-xxxxxx-elastic.svc:9200/_internal/desired_nodes/816e471c-4b7c-4322-8e25-9c52e8fbdc82/1: 400 Bad Request: {Status:400 Error:{CausedBy:{Reason: Type:} Reason:Nodes with ids [api-es-master-0,api-es-master-1,api-es-master-2,api-es-coordinator-0,api-es-coordinator-1,api-es-data-0,api-es-data-1,api-es-data-2] in positions [0,1,2,3,4,5,6,7] contain invalid settings Type:illegal_argument_exception StackTrace: RootCause:[{Reason:Nodes with ids [api-es-master-0,api-es-master-1,api-es-master-2,api-es-coordinator-0,api-es-coordinator-1,api-es-data-0,api-es-data-1,api-es-data-2] in positions [0,1,2,3,4,5,6,7] contain invalid settings Type:illegal_argument_exception}]}}"}],"error.stack_trace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.13.1/pkg/internal/controller/controller.go:326\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.13.1/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.13.1/pkg/internal/controller/controller.go:234"}
Operator Version: 2.6.1
Underyling Elastic Image Version: 8.6.2
Elastic Cluster State: Green
Elastic 'Phase': Stuck in Applying Changes
This issue did not occur when we updated the underlying ES image to 8.6. Only when the CRD 'version' was updated. the elastic db is also still in a 'green' state and working just fine. The only problem is that once it gets stuck in this 'applyingchanges' phase on its CRD, it becomes unresponsive to any CRD change and unmanageable from that end. A couple of our clusters have also eventually exited this state through no change of our own, but the majority in our test environment are still stuck in it while we test out this upgrade.
The error message being thrown by the operator is very confusing and doesnt actually specify what the invalid setting is that is causing the issue.
Any guidance? How can we troubleshoot the error message coming from the operator better? And how can the settings be invalid if the cluster/CRD is in a green state and open and serving traffic?