Upgrade ES version managed by ECK on terraform - all pods terminated

Hello everyone,

Our current Production Elasticsearch cluster for logs collection is manually managed and runs on AWS.
I'm creating the same cluster using ECK deployed with Helm under Terraform.
I was able to get all the features replicated (S3 repo for snapshots, ingest pipelines, index templates, etc) and deployed, but when I tried to update the cluster (changing the ES version from 8.3.2 to 8.5.2) I get a NEW elasticsearch cluster with version 8.5.2 in what doesn't appear as a rolling upgrade.

I can tell that it is a new cluster because the default 'elastic' superuser has a new password.

Also, when I check the kubernetes pods immediately after the terraform apply with updated ES version, the kibana pod doesn't even exists (probably normal) and all the ES nodes pods are simultaneously terminating.
I'm not ingesting data on this new cluster at the moment, but I'm sure that if it was the case, I would get an ingest interruption, and red health status (or maybe not, since I have what it looks like a completely new cluster...).

Most probably the problem is in my elasticsearch manifest, But I couldn't pinpoint the problem.

Here my ES manifest:

apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
  # copy the specified node labels as pod annotations and use it as an environment variable in the Pods; spreads a NodeSet across the availability zones of a Kubernetes cluster. Used for AZ awareness
  annotations:
    eck.k8s.elastic.co/downward-node-labels: "topology.kubernetes.io/zone"
  name: ${cluster_name}
  namespace: ${namespace}
spec:
  version: ${version}
  volumeClaimDeletePolicy: DeleteOnScaledown
  #updateStrategy:
  #  changeBudget:
  #    maxSurge: 1
  #    maxUnavailable: 1
  # for monitoring see: https://www.elastic.co/guide/en/cloud-on-k8s/current/k8s-stack-monitoring.html
  monitoring:
    metrics:
      elasticsearchRefs:
        - name: ${cluster_name}
    logs:
      elasticsearchRefs:
        - name: ${cluster_name}
  nodeSets:
    - name: logging-nodes
      count: ${nodes}
      config:
        # logger.org.elasticsearch: DEBUG
        node.roles: ["master","data", "ingest", "ml", "transform", "remote_cluster_client"]
        # this allows ES to run on nodes even if their vm.max_map_count has not been increased, at a performance cost
        node.store.allow_mmap: false
        cluster:
        # name: "logging.elasticsearch" See: https://www.elastic.co/guide/en/cloud-on-k8s/current/k8s-reserved-settings.html
          routing:
            rebalance.enable: "all"
            allocation:
              enable: "all"
              allow_rebalance: "always"
              node_concurrent_recoveries: ${node_concurrent_recoveries}
        # use the zone attribute from the node labels. Used for AZ awareness; double $ is used to escape during templating
        node.attr.zone: $${ZONE}
        cluster.routing.allocation.awareness.attributes: k8s_node_name,zone
        gateway.expected_data_nodes: ${nodes}
        indices.recovery.max_bytes_per_sec: ${index_recovery_speed}
        # network.host: ["_ec2:publicDns_", "localhost"]    See: https://www.elastic.co/guide/en/cloud-on-k8s/current/k8s-reserved-settings.html
        # xpack.security.enabled: true     See: https://www.elastic.co/guide/en/cloud-on-k8s/current/k8s-reserved-settings.html
      podTemplate:
        metadata:
          namespace: ${namespace}
          labels:
            # additional labels for pods
            stack_name: ${stack_name}
            stack_repository: ${stack_repository}
        spec:
          volumes:
            - name: aws-iam-token-es
              projected:
                defaultMode: 420
                sources:
                - serviceAccountToken:
                    audience: sts.amazonaws.com
                    expirationSeconds: 86400
                    path: aws-web-identity-token-file
          serviceAccountName: ${service_account}
          containers:
            - name: elasticsearch
              # specify resource limits and requests
              resources:
                limits:
                  memory: 4Gi
                  cpu: "1"
              volumeMounts:
              - mountPath: /usr/share/elasticsearch/config/repository-s3
                name: aws-iam-token-es
                readOnly: true
              env:
                # Makes the topology.kubernetes.io/zone annotation available as an environment variable and
                # use it as a cluster routing allocation attribute.
                - name: AWS_ROLE_SESSION_NAME
                  value: elasticsearch-sts
                - name: ZONE
                  valueFrom:
                    fieldRef:
                      fieldPath: metadata.annotations['topology.kubernetes.io/zone']
          # used for availability zone awareness
          topologySpreadConstraints:
            - maxSkew: 1
              topologyKey: topology.kubernetes.io/zone
              whenUnsatisfiable: DoNotSchedule
              labelSelector:
                matchLabels:
                  elasticsearch.k8s.elastic.co/cluster-name: ${cluster_name}
                  elasticsearch.k8s.elastic.co/statefulset-name: ${cluster_name}-es-default
      # request 15Gi of persistent data storage for pods in this topology element
      volumeClaimTemplates:
        - metadata:
            name: elasticsearch-data # Do not change this name unless you set up a volume mount for the data path.
          spec:
            accessModes:
              - ReadWriteOnce
            resources:
              requests:
                storage: 15Gi
            storageClassName: gp2

I can also post the kibana manifest but I don't think it is relevant.
To perform the upgrade, I just change the ${version} variable.

I think I'm having the same problem as in Deploy Elasticsearch Custom Resource with Terraform

I was never able to find a solution to this.

I did come across this post: Problem with preventing deletion of elastic volumeClaimTemplate created by terraform - Kubernetes - HashiCorp Discuss, which I believe is related. But I was never able to find a solution for this.

Thanks for jumping back here.
Before seeing your post I was using gavinbunney/kubectl , but your post 'inspired me' to give another try to the 'official' kubernetes_manifest.

Now, when I apply the elasticsearch version change I get the error:

│ The API returned the following conflict: "Apply failed with 1 conflict: conflict with \"elastic-operator\" using elasticsearch.k8s.elastic.co/v1: .spec.nodeSets"
│ 
│ You can override this conflict by setting "force_conflicts" to true in the "field_manager" block.

I tried to add the

  field_manager {
    force_conflicts = true
  }

but then I got:

│ Error: Provider produced inconsistent result after apply
│ 
│ When applying changes to kubernetes_manifest.kibana_deploy, provider "provider[\"registry.terraform.io/hashicorp/kubernetes\"]" produced an unexpected new value: .object: wrong final value type: incorrect object attributes.
│ 
│ This is a bug in the provider, which should be reported in the provider's own issue tracker.

so I think this is a no go.

But taking out the force_conflict, even if I was getting an error, the 'plan part' of terraform was saying that it was going to update and not replace the resources, so it is a step closer.

In the kubernetes manifest I used:

computed_fields = ["metadata.labels", "metadata.annotations","spec.finalizers","status"]

that I found in some other post.
Maybe if we found the complete list of required computed_fields it could work.

Researching.....

this is where I found it:

New, related question asked:

In the end the problem was that the eck operator does a lot of changes in the spec session, so if you add the whole "spec" to the computed_fields those changes are ignored and the upgrade proceed as intended:

resource "kubernetes_manifest" "elasticsearch_deploy" {
  field_manager {
    force_conflicts = true
  }
  computed_fields = ["metadata.labels", "metadata.annotations", "spec", "status"]
  manifest = yamldecode(templatefile("config/elasticsearch.yaml", {
    version                    = var.elastic_stack_version
    nodes                      = var.logging_elasticsearch_nodes_count
    cluster_name               = local.cluster_name
  }))
}

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.