Upgrade ES version managed by ECK on terraform - all pods terminated

Roberto_D_Arco · December 9, 2022, 4:06pm

Hello everyone,

Our current Production Elasticsearch cluster for logs collection is manually managed and runs on AWS.
I'm creating the same cluster using ECK deployed with Helm under Terraform.
I was able to get all the features replicated (S3 repo for snapshots, ingest pipelines, index templates, etc) and deployed, but when I tried to update the cluster (changing the ES version from 8.3.2 to 8.5.2) I get a NEW elasticsearch cluster with version 8.5.2 in what doesn't appear as a rolling upgrade.

I can tell that it is a new cluster because the default 'elastic' superuser has a new password.

Also, when I check the kubernetes pods immediately after the terraform apply with updated ES version, the kibana pod doesn't even exists (probably normal) and all the ES nodes pods are simultaneously terminating.
I'm not ingesting data on this new cluster at the moment, but I'm sure that if it was the case, I would get an ingest interruption, and red health status (or maybe not, since I have what it looks like a completely new cluster...).

Most probably the problem is in my elasticsearch manifest, But I couldn't pinpoint the problem.

Here my ES manifest:

apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
  # copy the specified node labels as pod annotations and use it as an environment variable in the Pods; spreads a NodeSet across the availability zones of a Kubernetes cluster. Used for AZ awareness
  annotations:
    eck.k8s.elastic.co/downward-node-labels: "topology.kubernetes.io/zone"
  name: ${cluster_name}
  namespace: ${namespace}
spec:
  version: ${version}
  volumeClaimDeletePolicy: DeleteOnScaledown
  #updateStrategy:
  #  changeBudget:
  #    maxSurge: 1
  #    maxUnavailable: 1
  # for monitoring see: https://www.elastic.co/guide/en/cloud-on-k8s/current/k8s-stack-monitoring.html
  monitoring:
    metrics:
      elasticsearchRefs:
        - name: ${cluster_name}
    logs:
      elasticsearchRefs:
        - name: ${cluster_name}
  nodeSets:
    - name: logging-nodes
      count: ${nodes}
      config:
        # logger.org.elasticsearch: DEBUG
        node.roles: ["master","data", "ingest", "ml", "transform", "remote_cluster_client"]
        # this allows ES to run on nodes even if their vm.max_map_count has not been increased, at a performance cost
        node.store.allow_mmap: false
        cluster:
        # name: "logging.elasticsearch" See: https://www.elastic.co/guide/en/cloud-on-k8s/current/k8s-reserved-settings.html
          routing:
            rebalance.enable: "all"
            allocation:
              enable: "all"
              allow_rebalance: "always"
              node_concurrent_recoveries: ${node_concurrent_recoveries}
        # use the zone attribute from the node labels. Used for AZ awareness; double $ is used to escape during templating
        node.attr.zone: $${ZONE}
        cluster.routing.allocation.awareness.attributes: k8s_node_name,zone
        gateway.expected_data_nodes: ${nodes}
        indices.recovery.max_bytes_per_sec: ${index_recovery_speed}
        # network.host: ["_ec2:publicDns_", "localhost"]    See: https://www.elastic.co/guide/en/cloud-on-k8s/current/k8s-reserved-settings.html
        # xpack.security.enabled: true     See: https://www.elastic.co/guide/en/cloud-on-k8s/current/k8s-reserved-settings.html
      podTemplate:
        metadata:
          namespace: ${namespace}
          labels:
            # additional labels for pods
            stack_name: ${stack_name}
            stack_repository: ${stack_repository}
        spec:
          volumes:
            - name: aws-iam-token-es
              projected:
                defaultMode: 420
                sources:
                - serviceAccountToken:
                    audience: sts.amazonaws.com
                    expirationSeconds: 86400
                    path: aws-web-identity-token-file
          serviceAccountName: ${service_account}
          containers:
            - name: elasticsearch
              # specify resource limits and requests
              resources:
                limits:
                  memory: 4Gi
                  cpu: "1"
              volumeMounts:
              - mountPath: /usr/share/elasticsearch/config/repository-s3
                name: aws-iam-token-es
                readOnly: true
              env:
                # Makes the topology.kubernetes.io/zone annotation available as an environment variable and
                # use it as a cluster routing allocation attribute.
                - name: AWS_ROLE_SESSION_NAME
                  value: elasticsearch-sts
                - name: ZONE
                  valueFrom:
                    fieldRef:
                      fieldPath: metadata.annotations['topology.kubernetes.io/zone']
          # used for availability zone awareness
          topologySpreadConstraints:
            - maxSkew: 1
              topologyKey: topology.kubernetes.io/zone
              whenUnsatisfiable: DoNotSchedule
              labelSelector:
                matchLabels:
                  elasticsearch.k8s.elastic.co/cluster-name: ${cluster_name}
                  elasticsearch.k8s.elastic.co/statefulset-name: ${cluster_name}-es-default
      # request 15Gi of persistent data storage for pods in this topology element
      volumeClaimTemplates:
        - metadata:
            name: elasticsearch-data # Do not change this name unless you set up a volume mount for the data path.
          spec:
            accessModes:
              - ReadWriteOnce
            resources:
              requests:
                storage: 15Gi
            storageClassName: gp2

I can also post the kibana manifest but I don't think it is relevant.
To perform the upgrade, I just change the ${version} variable.

Roberto_D_Arco · December 13, 2022, 10:07am

I think I'm having the same problem as in Deploy Elasticsearch Custom Resource with Terraform

BenB196 · December 13, 2022, 12:27pm

I was never able to find a solution to this.

I did come across this post: Problem with preventing deletion of elastic volumeClaimTemplate created by terraform - Kubernetes - HashiCorp Discuss, which I believe is related. But I was never able to find a solution for this.

Roberto_D_Arco · December 13, 2022, 1:54pm

Thanks for jumping back here.
Before seeing your post I was using gavinbunney/kubectl , but your post 'inspired me' to give another try to the 'official' kubernetes_manifest.

Now, when I apply the elasticsearch version change I get the error:

│ The API returned the following conflict: "Apply failed with 1 conflict: conflict with \"elastic-operator\" using elasticsearch.k8s.elastic.co/v1: .spec.nodeSets"
│ 
│ You can override this conflict by setting "force_conflicts" to true in the "field_manager" block.

I tried to add the

  field_manager {
    force_conflicts = true
  }

but then I got:

│ Error: Provider produced inconsistent result after apply
│ 
│ When applying changes to kubernetes_manifest.kibana_deploy, provider "provider[\"registry.terraform.io/hashicorp/kubernetes\"]" produced an unexpected new value: .object: wrong final value type: incorrect object attributes.
│ 
│ This is a bug in the provider, which should be reported in the provider's own issue tracker.

so I think this is a no go.

But taking out the force_conflict, even if I was getting an error, the 'plan part' of terraform was saying that it was going to update and not replace the resources, so it is a step closer.

In the kubernetes manifest I used:

computed_fields = ["metadata.labels", "metadata.annotations","spec.finalizers","status"]

that I found in some other post.
Maybe if we found the complete list of required computed_fields it could work.

Researching.....

Roberto_D_Arco · December 13, 2022, 2:28pm

this is where I found it:

github.com/hashicorp/terraform-provider-kubernetes

Cycle error on kubernetes_manifest destroy

opened 12:17PM - 10 Jun 22 UTC

closed 02:29PM - 30 Jul 22 UTC

Andrei-Predoiu

question

## Terraform version, Kubernetes provider version and Kubernetes version ``` T…erraform version: 1.2.1 Go runtime version: go1.18.1 hashicorp/kubernetes/2.11.0 kubectl Version:"v1.24.1" ``` ## Terraform configuration A lot is missing but you should get the idea. ```hcl ### GKE Module resource "google_container_cluster" "primary" { provider = google-beta name = var.cluster_name location = var.location project = var.project remove_default_node_pool = true initial_node_count = 1 .... } output "cluster_ca_certificate" { value = google_container_cluster.primary.master_auth.0.cluster_ca_certificate } output "endpoint" { value = google_container_cluster.primary.endpoint } ### Flux module data "google_client_config" "default" {} provider "kubernetes" { host = "https://${module.gke.endpoint}" token = data.google_client_config.default.access_token cluster_ca_certificate = base64decode( module.gke.cluster_ca_certificate, ) } locals { raw_emissary_manifests = split("---", file("${path.root}/flux-config/conditionals/${var.env_type}/ambassador.yaml")) hcl_emissary_manifests = [for manifest in local.raw_emissary_manifests : yamldecode(manifest)] emissary_cfg = base64encode(<<YAML service: annotations: cloud.google.com/load-balancer-type: "${(var.external) ? "External" : "Internal"}" external-dns.alpha.kubernetes.io/hostname: ${(var.external) ? "amb.${var.cluster_name}.bestsellerit.com" : "ambassador.${var.cluster_name}.k8s.bestcorp.net"} tags.datadoghq.com/env: "${var.env_type}" podLabels: tags.datadoghq.com/env: "${var.env_type}" YAML ) } resource "kubernetes_cluster_role_binding" "admin" { metadata { name = "cluster-admin-binding" } role_ref { api_group = "rbac.authorization.k8s.io" kind = "ClusterRole" name = "cluster-admin" } subject { kind = "User" name = var.deployment_account_email api_group = "rbac.authorization.k8s.io" } } [...] resource "kubernetes_manifest" "emissary_ns" { depends_on = [kubernetes_cluster_role_binding.admin, helm_release.gatekeeper] computed_fields = ["metadata.labels", "metadata.annotations","spec.finalizers","status"] manifest = yamldecode(file("${path.module}/conditionals/emissary/namespace.yaml")) } resource "kubernetes_manifest" "emissary_cert" { count = var.ambassador ? 1 : 0 depends_on = [kubectl_manifest.sync_flux, kubernetes_manifest.emissary_ns] computed_fields = ["metadata.labels", "metadata.annotations","spec.finalizers","status"] manifest = yamldecode(templatefile("${path.module}/conditionals/emissary/certificate.yaml", { dns = (var.external) ? "amb.${var.cluster_name}.bestsellerit.com" : "ambassador.${var.cluster_name}.k8s.bestcorp.net" })) } resource "kubernetes_manifest" "emissary_cfg" { count = var.ambassador ? 1 : 0 depends_on = [kubectl_manifest.sync_flux, kubernetes_manifest.emissary_ns] computed_fields = ["metadata.labels", "metadata.annotations","spec.finalizers","status"] manifest = yamldecode(templatefile("${path.module}/conditionals/emissary/helm-config.yaml", { dataBase64 : local.emissary_cfg })) } ``` ## Question ``` Hi, i have some kubernetes resources that i was managing using the old kubectl provider. I have removed them from the state of the old provider and imported them into the new one. I have two problems: 1. Terraform wants to destroy my imported resources and there is no prompt stating what the reason is, this is not so important right now and maybe it's an improvement for a future version. 2. I am getting a cycle error on destroy. I am not sure why the gke part depends on the kubectl_manifest Error: Cycle: module.flux.kubernetes_manifest.emissary_docs[0] (destroy), module.flux.kubernetes_manifest.emissary_docs[3] (destroy), module.flux.kubernetes_manifest.emissary_docs[1] (destroy), module.gke.output.cluster_ca_certificate (expand), module.gke.output.endpoint (expand), provider["registry.terraform.io/hashicorp/kubernetes"], module.flux.kubernetes_manifest.emissary_docs[2] (destroy), module.gke.google_container_cluster.primary ``` Debug output: ``` 2022-06-10T06:13:31.649Z [ERROR] Graph validation failed. Graph: [...] module.gke.output.cluster_ca_certificate (expand) module.gke (expand) module.gke.google_container_cluster.primary module.gke.google_container_cluster.primary (expand) [...] module.gke.output.endpoint (expand) module.gke (expand) module.gke.google_container_cluster.primary module.gke.google_container_cluster.primary (expand) [...] provider["registry.terraform.io/hashicorp/kubernetes"] module.gke.output.cluster_ca_certificate (expand) module.gke.output.endpoint (expand) [...] module.flux.kubernetes_manifest.emissary_docs[0] (destroy) module.vault.null_resource.create_cluster_policies (destroy) provider["registry.terraform.io/hashicorp/kubernetes"] module.flux.kubernetes_manifest.emissary_docs[1] (destroy) module.vault.null_resource.create_cluster_policies (destroy) provider["registry.terraform.io/hashicorp/kubernetes"] [...] module.flux.kubernetes_manifest.emissary_docs[2] (destroy) module.vault.null_resource.create_cluster_policies (destroy) provider["registry.terraform.io/hashicorp/kubernetes"] [...] module.flux.kubernetes_manifest.emissary_docs[3] (destroy) module.vault.null_resource.create_cluster_policies (destroy) provider["registry.terraform.io/hashicorp/kubernetes"] [...] module.gke.google_container_cluster.primary module.flux.kubernetes_manifest.emissary_docs[0] (destroy) module.flux.kubernetes_manifest.emissary_docs[1] (destroy) module.flux.kubernetes_manifest.emissary_docs[2] (destroy) module.flux.kubernetes_manifest.emissary_docs[3] (destroy) module.flux.time_sleep.wait_60_seconds (destroy) module.gke (expand) module.gke.data.google_container_engine_versions.k8s_version (expand) module.gke.data.google_project.project (expand) module.gke.google_bigquery_dataset.dataset (expand) module.gke.google_container_cluster.primary (expand) module.gke.local.network (expand) module.gke.local.subnetwork (expand) module.gke.var.cluster_name (expand) module.gke.var.cluster_secondary_range_name (expand) module.gke.var.enable_gke_ingress (expand) module.gke.var.external (expand) module.gke.var.external_auto_subnet (expand) module.gke.var.location (expand) module.gke.var.master_ipv4_cidr_block (expand) module.gke.var.namespaces (expand) module.gke.var.project (expand) module.gke.var.services_secondary_range_name (expand) module.vault.null_resource.create_cluster_policies (destroy) provider["registry.terraform.io/hashicorp/google-beta"] ```

Roberto_D_Arco · December 13, 2022, 5:07pm

New, related question asked:

Roberto_D_Arco · December 15, 2022, 11:19am

In the end the problem was that the eck operator does a lot of changes in the spec session, so if you add the whole "spec" to the computed_fields those changes are ignored and the upgrade proceed as intended:

resource "kubernetes_manifest" "elasticsearch_deploy" {
  field_manager {
    force_conflicts = true
  }
  computed_fields = ["metadata.labels", "metadata.annotations", "spec", "status"]
  manifest = yamldecode(templatefile("config/elasticsearch.yaml", {
    version                    = var.elastic_stack_version
    nodes                      = var.logging_elasticsearch_nodes_count
    cluster_name               = local.cluster_name
  }))
}

system · January 12, 2023, 11:20am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Bug in eck operator? Cluster upgrade fails (under terraform) Elastic Cloud on Kubernetes (ECK)	2	686	January 12, 2023
Recommendation for upgrading underlying kubernetes nodes Elastic Cloud on Kubernetes (ECK)	6	977	November 4, 2022
Upgrade ECK cluster 7.5.2 to 7.6.0 Elasticsearch	1	473	March 12, 2020
HOW? Upgrade from 8.13 to 8.15 using ECK Elastic Cloud on Kubernetes (ECK) docker	1	140	August 15, 2024
Update strategy and Persistent Volumes for Elastic Cloud on Kubernetes Elastic Cloud on Kubernetes (ECK) docker	0	66	August 21, 2024

Upgrade ES version managed by ECK on terraform - all pods terminated

Related topics