Elastic Operator 2.6.1 Failing Reconciliation with "Invalid Settings" Error on ES 8.6.2 in AKS but not specifying what setting is invalid

Hi - We are trying to upgrade our elastic operator to 2.6.1 from 1.9, and subsequently our elastic cluster deployed in K8s and managed by the operator from 7.17 to 8.6.2.

Upgrading the operator to 2.6.1 intially posed no issues. Same for when we upgraded our underlying ES image to 8.6.2. but when we upgraded the Elasticsearch CRD to have version 8.6.2 in its spec and applied the CRD changes to k8s, the operator started erroring out on the Reconciliation run with the following error:

{"log.level":"error","@timestamp":"2023-03-06T20:54:38.612Z","log.logger":"manager.eck-operator","message":"Reconciler error","service.version":"2.6.1+62f2e278","service.type":"eck","ecs.version":"1.4.0","controller":"elasticsearch-controller","object":{"name":"api","namespace":"qa-xxxxx-elastic"},"namespace":"qa-xxxx-elastic","name":"api","reconcileID":"0f3ec496-5582-4c4a-85ff-0a320c29e171","error":"elasticsearch client failed for https://api-es-internal-http.qa-xxxxx-elastic.svc:9200/_internal/desired_nodes/816e471c-4b7c-4322-8e25-9c52e8fbdc82/1: 400 Bad **Request: {Status:400 Error:{CausedBy:{Reason: Type:} Reason:Nodes with ids [api-es-master-0,api-es-master-1,api-es-master-2,api-es-coordinator-0,api-es-coordinator-1,api-es-data-0,api-es-data-1,api-es-data-2] in positions [0,1,2,3,4,5,6,7] contain invalid settings Type:illegal_argument_exception StackTrace: RootCause:[{Reason:Nodes with ids [api-es-master-0,api-es-master-1,api-es-master-2,api-es-coordinator-0,api-es-coordinator-1,api-es-data-0,api-es-data-1,api-es-data-2] in positions [0,1,2,3,4,5,6,7] contain invalid settings Type:illegal_argument_exception}]}}**","errorCauses":[{"error":"elasticsearch client failed for https://api-es-internal-http.qa-xxxxxx-elastic.svc:9200/_internal/desired_nodes/816e471c-4b7c-4322-8e25-9c52e8fbdc82/1: 400 Bad Request: {Status:400 Error:{CausedBy:{Reason: Type:} Reason:Nodes with ids [api-es-master-0,api-es-master-1,api-es-master-2,api-es-coordinator-0,api-es-coordinator-1,api-es-data-0,api-es-data-1,api-es-data-2] in positions [0,1,2,3,4,5,6,7] contain invalid settings Type:illegal_argument_exception StackTrace: RootCause:[{Reason:Nodes with ids [api-es-master-0,api-es-master-1,api-es-master-2,api-es-coordinator-0,api-es-coordinator-1,api-es-data-0,api-es-data-1,api-es-data-2] in positions [0,1,2,3,4,5,6,7] contain invalid settings Type:illegal_argument_exception}]}}"}],"error.stack_trace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.13.1/pkg/internal/controller/controller.go:326\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.13.1/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.13.1/pkg/internal/controller/controller.go:234"}

Operator Version: 2.6.1
Underyling Elastic Image Version: 8.6.2
Elastic Cluster State: Green
Elastic 'Phase': Stuck in Applying Changes

This issue did not occur when we updated the underlying ES image to 8.6. Only when the CRD 'version' was updated. the elastic db is also still in a 'green' state and working just fine. The only problem is that once it gets stuck in this 'applyingchanges' phase on its CRD, it becomes unresponsive to any CRD change and unmanageable from that end. A couple of our clusters have also eventually exited this state through no change of our own, but the majority in our test environment are still stuck in it while we test out this upgrade.

The error message being thrown by the operator is very confusing and doesnt actually specify what the invalid setting is that is causing the issue.

Any guidance? How can we troubleshoot the error message coming from the operator better? And how can the settings be invalid if the cluster/CRD is in a green state and open and serving traffic?

Small update on this - i was able to resolve the error by setting my elasticsearch CRD version to 8.2.3 . This was able to successfully spin up our clusters running an ES 8.6.2 base image with operator 2.6.1.

Any CRD version higher than 8.2.3 forced the error, including 8.3.3, 8.5.1, and 8.6.2.

No clue for the reason for this, but was able to pinpoint the configuration change that was forcing the error. Hopefully it isnt too big of a deal to set the CRD to a different version # than the underlying image version? Haven't seen any issues yet .

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.