ECK managed cluster was re-created unexpectedly

Today I faced an unexpected issue with one of our production clusters. I was alerted to our cluster being unavailable / giving authorization issues during writes. When I looked at the pods running on our Kubernetes cluster, I noticed that all Elasticsearch master and data node pods had been restarted recently. Looking at the elasticsearch resource itself, the cluster health was at unknown and all the master nodes hadn't even been started yet.

When I looked at the secrets, I also noticed that all the ECK managed secrets had been re-created. As well as the Persistent Volume Claims. So the cluster had effectively been wiped clean, but it wasn't being properly initialized for some reason. Looking at master node logs, I found the following error mentioned a lot:

{"type": "server", "timestamp": "2020-09-24T07:35:53,694Z", "level": "WARN", "component": "o.e.c.c.ClusterFormationFailureHelper", "cluster.name": "project-elastic", "node.name": "project-elastic-es-masters-0", "message": "master not discovered yet, this node has not previously joined a bootstrapped (v7+) cluster, and [cluster.initial_master_nodes] is empty on this node: have discovered [{project-elastic-es-masters-0}{qNwxDj1dSiufubGB1YqVdw}{49NKxsCjQ_GxDoZmTNW3jQ}{172.19.12.7}{172.19.12.7:9300}{lmr}{ml.machine_memory=12884901888, xpack.installed=true, transform.node=false, ml.max_open_jobs=20}, {project-elastic-es-masters-1}{986ImUXYQY-Wio6TUxiz1w}{v_1LWCl-TSOIMZx_kmY3mA}{172.19.11.7}{172.19.11.7:9300}{lmr}{ml.machine_memory=12884901888, ml.max_open_jobs=20, xpack.installed=true, transform.node=false}]; discovery will continue using [127.0.0.1:9300, 127.0.0.1:9301, 127.0.0.1:9302, 127.0.0.1:9303, 127.0.0.1:9304, 127.0.0.1:9305, 172.19.11.7:9300] from hosts providers and [{project-elastic-es-masters-0}{qNwxDj1dSiufubGB1YqVdw}{49NKxsCjQ_GxDoZmTNW3jQ}{172.19.12.7}{172.19.12.7:9300}{lmr}{ml.machine_memory=12884901888, xpack.installed=true, transform.node=false, ml.max_open_jobs=20}] from last-known cluster state; node term 0, last-accepted version 0 in term 0" }

To remedy this situation, I started scaling down the statefulsets for master and data nodes and re-created the persistent volume claims by hand to point back to the previous disks that were used by this cluster. Once I managed to do that, restarting the pods brought the cluster back online. Only noticeable change was the fact that elastic user had a new password now.

After I got the cluster back online and could verify that everything was working again, I started looking through ECK operator logs to see, if I could notice any clear signs for what had happened. The first row that I could find pointing to any possible "smoking gun" was this:

{"log.level":"info","@timestamp":"2020-09-24T04:05:34.964Z","log.logger":"transport","message":"Certificate was not valid, should issue new: x509: certificate signed by unknown authority (possibly because of \"crypto/rsa: verification error\" while trying to verify candidate authority certificate \"project-elastic-transport\")","service.version":"1.2.1-b5316231","service.type":"eck","ecs.version":"1.4.0","namespace":"project","subject":"CN=project-elastic-es-data-new-4.node.project-elastic.project.es.local,OU=project-elastic","issuer":"CN=project-elastic-transport,OU=project-elastic","current_ca_subject":"CN=project-elastic-transport,OU=project-elastic","pod":"project-elastic-es-data-new-4"}

This was mentioned for all the data and master nodes of the cluster. Afterwards I can see log lines about re-creating all resources:

{"log.level":"info","@timestamp":"2020-09-24T04:05:34.777Z","log.logger":"generic-reconciler","message":"Creating resource","service.version":"1.2.1-b5316231","service.type":"eck","ecs.version":"1.4.0","kind":"ConfigMap","namespace":"project","name":"project-elastic-es-scripts"}
... truncating this to fit to the post size limits, but basically every secret, configmap, service and statefulset gets created

And then some repeating reconciler issues, because the cluster isn't responding or the service isn't available. So while the immediate issue is fixed, I'd like to understand what caused this and what I can do to prevent this from happening in the future. Here's my elasticsearch definition for the cluster:

apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
  annotations:
    common.k8s.elastic.co/controller-version: 1.2.1
    elasticsearch.k8s.elastic.co/cluster-uuid: fAQD7UJvSxOOnPTGBNVbFw
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"elasticsearch.k8s.elastic.co/v1","kind":"Elasticsearch","metadata":{"annotations":{},"name":"project-elastic","namespace":"project"},"spec":{"http":{"tls":{"selfSignedCertificate":{"subjectAltNames":[{"dns":"project-elastic-es-data-0.project.svc"},{"dns":"project-elastic-es-data-1.project.svc"},{"dns":"project-elastic-es-data-2.project.svc"},{"dns":"project-elastic-es-data-3.project.svc"},{"dns":"project-elastic-es-data-4.project.svc"},{"dns":"project-elastic-es-masters-0.project.svc"},{"dns":"project-elastic-es-masters-1.project.svc"},{"dns":"project-elastic-es-masters-2.project.svc"},{"dns":"project-elastic-es-http.project.svc.cluster.local"}]}}},"nodeSets":[{"config":{"indices.memory.index_buffer_size":"30%","node.data":false,"node.ingest":false,"node.master":true},"count":3,"name":"masters","podTemplate":{"spec":{"containers":[{"env":[{"name":"ES_JAVA_OPTS","value":"-Xms6g -Xmx6g"}],"name":"elasticsearch","resources":{"limits":{"cpu":2,"memory":"12Gi"},"requests":{"cpu":0.25,"memory":"12Gi"}}}],"initContainers":[{"command":["sh","-c","sysctl -w vm.max_map_count=262144"],"name":"sysctl","securityContext":{"privileged":true}},{"command":["sh","-c","bin/elasticsearch-plugin install --batch repository-azure\n"],"name":"install-plugins"}],"nodeSelector":{"agentpool":"projectes"},"tolerations":[{"effect":"NoSchedule","key":"dedicated","operator":"Equal","value":"elasticsearch"}]}},"volumeClaimTemplates":[{"metadata":{"name":"elasticsearch-data"},"spec":{"accessModes":["ReadWriteOnce"],"resources":{"requests":{"storage":"64Gi"}},"storageClassName":"managed-premium-retain"}}]},{"config":{"indices.memory.index_buffer_size":"30%","node.data":true,"node.ingest":true,"node.master":false},"count":5,"name":"data-new","podTemplate":{"spec":{"containers":[{"env":[{"name":"ES_JAVA_OPTS","value":"-Xms8g -Xmx8g"}],"name":"elasticsearch","resources":{"limits":{"cpu":4,"memory":"16Gi"},"requests":{"cpu":2,"memory":"16Gi"}}}],"initContainers":[{"command":["sh","-c","sysctl -w vm.max_map_count=262144"],"name":"sysctl","securityContext":{"privileged":true}},{"command":["sh","-c","bin/elasticsearch-plugin install --batch repository-azure\n"],"name":"install-plugins"}],"nodeSelector":{"agentpool":"projectes"},"tolerations":[{"effect":"NoSchedule","key":"dedicated","operator":"Equal","value":"elasticsearch"}]}},"volumeClaimTemplates":[{"metadata":{"name":"elasticsearch-data"},"spec":{"accessModes":["ReadWriteOnce"],"resources":{"requests":{"storage":"1024Gi"}},"storageClassName":"managed-premium-retain"}}]}],"secureSettings":[{"secretName":"project-es-backup-storage-account"},{"secretName":"project-es-backup-storage-key"}],"version":"7.7.0"}}
  creationTimestamp: "2020-03-10T11:29:23Z"
  generation: 105
  name: project-elastic
  namespace: project
  resourceVersion: "63150509"
  selfLink: /apis/elasticsearch.k8s.elastic.co/v1/namespaces/project/elasticsearches/project-elastic
  uid: 969cc88e-de07-4906-a4d8-4f6b00ca47ec
spec:
  auth: {}
  http:
    service:
      metadata:
        creationTimestamp: null
      spec: {}
    tls:
      certificate: {}
      selfSignedCertificate:
        subjectAltNames:
        - dns: project-elastic-es-data-0.project.svc
        - dns: project-elastic-es-data-1.project.svc
        - dns: project-elastic-es-data-2.project.svc
        - dns: project-elastic-es-data-3.project.svc
        - dns: project-elastic-es-data-4.project.svc
        - dns: project-elastic-es-masters-0.project.svc
        - dns: project-elastic-es-masters-1.project.svc
        - dns: project-elastic-es-masters-2.project.svc
        - dns: project-elastic-es-http.project.svc.cluster.local
  nodeSets:
  - config:
      indices.memory.index_buffer_size: 30%
      node.data: false
      node.ingest: false
      node.master: true
    count: 3
    name: masters
    podTemplate:
      spec:
        containers:
        - env:
          - name: ES_JAVA_OPTS
            value: -Xms6g -Xmx6g
          name: elasticsearch
          resources:
            limits:
              cpu: 2
              memory: 12Gi
            requests:
              cpu: 0.25
              memory: 12Gi
        initContainers:
        - command:
          - sh
          - -c
          - sysctl -w vm.max_map_count=262144
          name: sysctl
          securityContext:
            privileged: true
        - command:
          - sh
          - -c
          - |
            bin/elasticsearch-plugin install --batch repository-azure
          name: install-plugins
        nodeSelector:
          agentpool: projectes
        tolerations:
        - effect: NoSchedule
          key: dedicated
          operator: Equal
          value: elasticsearch
    volumeClaimTemplates:
    - metadata:
        name: elasticsearch-data
      spec:
        accessModes:
        - ReadWriteOnce
        resources:
          requests:
            storage: 64Gi
        storageClassName: managed-premium-retain
  - config:
      indices.memory.index_buffer_size: 30%
      node.data: true
      node.ingest: true
      node.master: false
    count: 5
    name: data-new
    podTemplate:
      spec:
        containers:
        - env:
          - name: ES_JAVA_OPTS
            value: -Xms8g -Xmx8g
          name: elasticsearch
          resources:
            limits:
              cpu: 4
              memory: 16Gi
            requests:
              cpu: 2
              memory: 16Gi
        initContainers:
        - command:
          - sh
          - -c
          - sysctl -w vm.max_map_count=262144
          name: sysctl
          securityContext:
            privileged: true
        - command:
          - sh
          - -c
          - |
            bin/elasticsearch-plugin install --batch repository-azure
          name: install-plugins
        nodeSelector:
          agentpool: projectes
        tolerations:
        - effect: NoSchedule
          key: dedicated
          operator: Equal
          value: elasticsearch
    volumeClaimTemplates:
    - metadata:
        name: elasticsearch-data
      spec:
        accessModes:
        - ReadWriteOnce
        resources:
          requests:
            storage: 1024Gi
        storageClassName: managed-premium-retain
  secureSettings:
  - secretName: project-es-backup-storage-account
  - secretName: project-es-backup-storage-key
  transport:
    service:
      metadata:
        creationTimestamp: null
      spec: {}
  updateStrategy:
    changeBudget: {}
  version: 7.7.0
status:
  availableNodes: 8
  health: yellow
  phase: Ready

We recently upgraded this cluster to run on 1.17.9, but even the last nodepool for this cluster was upgraded more than ~20 hours before these issues cropped up, so I'm not sure if that's related. We are running version 1.2.1 of ECK and the cluster is provisioned with Azure Kubernetes Service.

The symptoms you described line up with this upstream kubernetes issue: https://www.elastic.co/guide/en/cloud-on-k8s/1.2/k8s-common-problems.html#k8s-common-problems-owner-refs

Oh wow, that is most likely it. I noticed one copied certificate secret disappearing as well. Thank you so much for pointing this out, was feeling pretty lost as to why this all happened!