Today I faced an unexpected issue with one of our production clusters. I was alerted to our cluster being unavailable / giving authorization issues during writes. When I looked at the pods running on our Kubernetes cluster, I noticed that all Elasticsearch master and data node pods had been restarted recently. Looking at the elasticsearch
resource itself, the cluster health was at unknown and all the master nodes hadn't even been started yet.
When I looked at the secrets, I also noticed that all the ECK managed secrets had been re-created. As well as the Persistent Volume Claims. So the cluster had effectively been wiped clean, but it wasn't being properly initialized for some reason. Looking at master node logs, I found the following error mentioned a lot:
{"type": "server", "timestamp": "2020-09-24T07:35:53,694Z", "level": "WARN", "component": "o.e.c.c.ClusterFormationFailureHelper", "cluster.name": "project-elastic", "node.name": "project-elastic-es-masters-0", "message": "master not discovered yet, this node has not previously joined a bootstrapped (v7+) cluster, and [cluster.initial_master_nodes] is empty on this node: have discovered [{project-elastic-es-masters-0}{qNwxDj1dSiufubGB1YqVdw}{49NKxsCjQ_GxDoZmTNW3jQ}{172.19.12.7}{172.19.12.7:9300}{lmr}{ml.machine_memory=12884901888, xpack.installed=true, transform.node=false, ml.max_open_jobs=20}, {project-elastic-es-masters-1}{986ImUXYQY-Wio6TUxiz1w}{v_1LWCl-TSOIMZx_kmY3mA}{172.19.11.7}{172.19.11.7:9300}{lmr}{ml.machine_memory=12884901888, ml.max_open_jobs=20, xpack.installed=true, transform.node=false}]; discovery will continue using [127.0.0.1:9300, 127.0.0.1:9301, 127.0.0.1:9302, 127.0.0.1:9303, 127.0.0.1:9304, 127.0.0.1:9305, 172.19.11.7:9300] from hosts providers and [{project-elastic-es-masters-0}{qNwxDj1dSiufubGB1YqVdw}{49NKxsCjQ_GxDoZmTNW3jQ}{172.19.12.7}{172.19.12.7:9300}{lmr}{ml.machine_memory=12884901888, xpack.installed=true, transform.node=false, ml.max_open_jobs=20}] from last-known cluster state; node term 0, last-accepted version 0 in term 0" }
To remedy this situation, I started scaling down the statefulsets for master and data nodes and re-created the persistent volume claims by hand to point back to the previous disks that were used by this cluster. Once I managed to do that, restarting the pods brought the cluster back online. Only noticeable change was the fact that elastic
user had a new password now.
After I got the cluster back online and could verify that everything was working again, I started looking through ECK operator logs to see, if I could notice any clear signs for what had happened. The first row that I could find pointing to any possible "smoking gun" was this:
{"log.level":"info","@timestamp":"2020-09-24T04:05:34.964Z","log.logger":"transport","message":"Certificate was not valid, should issue new: x509: certificate signed by unknown authority (possibly because of \"crypto/rsa: verification error\" while trying to verify candidate authority certificate \"project-elastic-transport\")","service.version":"1.2.1-b5316231","service.type":"eck","ecs.version":"1.4.0","namespace":"project","subject":"CN=project-elastic-es-data-new-4.node.project-elastic.project.es.local,OU=project-elastic","issuer":"CN=project-elastic-transport,OU=project-elastic","current_ca_subject":"CN=project-elastic-transport,OU=project-elastic","pod":"project-elastic-es-data-new-4"}
This was mentioned for all the data and master nodes of the cluster. Afterwards I can see log lines about re-creating all resources:
{"log.level":"info","@timestamp":"2020-09-24T04:05:34.777Z","log.logger":"generic-reconciler","message":"Creating resource","service.version":"1.2.1-b5316231","service.type":"eck","ecs.version":"1.4.0","kind":"ConfigMap","namespace":"project","name":"project-elastic-es-scripts"}
... truncating this to fit to the post size limits, but basically every secret, configmap, service and statefulset gets created
And then some repeating reconciler issues, because the cluster isn't responding or the service isn't available. So while the immediate issue is fixed, I'd like to understand what caused this and what I can do to prevent this from happening in the future. Here's my elasticsearch
definition for the cluster:
apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
annotations:
common.k8s.elastic.co/controller-version: 1.2.1
elasticsearch.k8s.elastic.co/cluster-uuid: fAQD7UJvSxOOnPTGBNVbFw
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"elasticsearch.k8s.elastic.co/v1","kind":"Elasticsearch","metadata":{"annotations":{},"name":"project-elastic","namespace":"project"},"spec":{"http":{"tls":{"selfSignedCertificate":{"subjectAltNames":[{"dns":"project-elastic-es-data-0.project.svc"},{"dns":"project-elastic-es-data-1.project.svc"},{"dns":"project-elastic-es-data-2.project.svc"},{"dns":"project-elastic-es-data-3.project.svc"},{"dns":"project-elastic-es-data-4.project.svc"},{"dns":"project-elastic-es-masters-0.project.svc"},{"dns":"project-elastic-es-masters-1.project.svc"},{"dns":"project-elastic-es-masters-2.project.svc"},{"dns":"project-elastic-es-http.project.svc.cluster.local"}]}}},"nodeSets":[{"config":{"indices.memory.index_buffer_size":"30%","node.data":false,"node.ingest":false,"node.master":true},"count":3,"name":"masters","podTemplate":{"spec":{"containers":[{"env":[{"name":"ES_JAVA_OPTS","value":"-Xms6g -Xmx6g"}],"name":"elasticsearch","resources":{"limits":{"cpu":2,"memory":"12Gi"},"requests":{"cpu":0.25,"memory":"12Gi"}}}],"initContainers":[{"command":["sh","-c","sysctl -w vm.max_map_count=262144"],"name":"sysctl","securityContext":{"privileged":true}},{"command":["sh","-c","bin/elasticsearch-plugin install --batch repository-azure\n"],"name":"install-plugins"}],"nodeSelector":{"agentpool":"projectes"},"tolerations":[{"effect":"NoSchedule","key":"dedicated","operator":"Equal","value":"elasticsearch"}]}},"volumeClaimTemplates":[{"metadata":{"name":"elasticsearch-data"},"spec":{"accessModes":["ReadWriteOnce"],"resources":{"requests":{"storage":"64Gi"}},"storageClassName":"managed-premium-retain"}}]},{"config":{"indices.memory.index_buffer_size":"30%","node.data":true,"node.ingest":true,"node.master":false},"count":5,"name":"data-new","podTemplate":{"spec":{"containers":[{"env":[{"name":"ES_JAVA_OPTS","value":"-Xms8g -Xmx8g"}],"name":"elasticsearch","resources":{"limits":{"cpu":4,"memory":"16Gi"},"requests":{"cpu":2,"memory":"16Gi"}}}],"initContainers":[{"command":["sh","-c","sysctl -w vm.max_map_count=262144"],"name":"sysctl","securityContext":{"privileged":true}},{"command":["sh","-c","bin/elasticsearch-plugin install --batch repository-azure\n"],"name":"install-plugins"}],"nodeSelector":{"agentpool":"projectes"},"tolerations":[{"effect":"NoSchedule","key":"dedicated","operator":"Equal","value":"elasticsearch"}]}},"volumeClaimTemplates":[{"metadata":{"name":"elasticsearch-data"},"spec":{"accessModes":["ReadWriteOnce"],"resources":{"requests":{"storage":"1024Gi"}},"storageClassName":"managed-premium-retain"}}]}],"secureSettings":[{"secretName":"project-es-backup-storage-account"},{"secretName":"project-es-backup-storage-key"}],"version":"7.7.0"}}
creationTimestamp: "2020-03-10T11:29:23Z"
generation: 105
name: project-elastic
namespace: project
resourceVersion: "63150509"
selfLink: /apis/elasticsearch.k8s.elastic.co/v1/namespaces/project/elasticsearches/project-elastic
uid: 969cc88e-de07-4906-a4d8-4f6b00ca47ec
spec:
auth: {}
http:
service:
metadata:
creationTimestamp: null
spec: {}
tls:
certificate: {}
selfSignedCertificate:
subjectAltNames:
- dns: project-elastic-es-data-0.project.svc
- dns: project-elastic-es-data-1.project.svc
- dns: project-elastic-es-data-2.project.svc
- dns: project-elastic-es-data-3.project.svc
- dns: project-elastic-es-data-4.project.svc
- dns: project-elastic-es-masters-0.project.svc
- dns: project-elastic-es-masters-1.project.svc
- dns: project-elastic-es-masters-2.project.svc
- dns: project-elastic-es-http.project.svc.cluster.local
nodeSets:
- config:
indices.memory.index_buffer_size: 30%
node.data: false
node.ingest: false
node.master: true
count: 3
name: masters
podTemplate:
spec:
containers:
- env:
- name: ES_JAVA_OPTS
value: -Xms6g -Xmx6g
name: elasticsearch
resources:
limits:
cpu: 2
memory: 12Gi
requests:
cpu: 0.25
memory: 12Gi
initContainers:
- command:
- sh
- -c
- sysctl -w vm.max_map_count=262144
name: sysctl
securityContext:
privileged: true
- command:
- sh
- -c
- |
bin/elasticsearch-plugin install --batch repository-azure
name: install-plugins
nodeSelector:
agentpool: projectes
tolerations:
- effect: NoSchedule
key: dedicated
operator: Equal
value: elasticsearch
volumeClaimTemplates:
- metadata:
name: elasticsearch-data
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 64Gi
storageClassName: managed-premium-retain
- config:
indices.memory.index_buffer_size: 30%
node.data: true
node.ingest: true
node.master: false
count: 5
name: data-new
podTemplate:
spec:
containers:
- env:
- name: ES_JAVA_OPTS
value: -Xms8g -Xmx8g
name: elasticsearch
resources:
limits:
cpu: 4
memory: 16Gi
requests:
cpu: 2
memory: 16Gi
initContainers:
- command:
- sh
- -c
- sysctl -w vm.max_map_count=262144
name: sysctl
securityContext:
privileged: true
- command:
- sh
- -c
- |
bin/elasticsearch-plugin install --batch repository-azure
name: install-plugins
nodeSelector:
agentpool: projectes
tolerations:
- effect: NoSchedule
key: dedicated
operator: Equal
value: elasticsearch
volumeClaimTemplates:
- metadata:
name: elasticsearch-data
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 1024Gi
storageClassName: managed-premium-retain
secureSettings:
- secretName: project-es-backup-storage-account
- secretName: project-es-backup-storage-key
transport:
service:
metadata:
creationTimestamp: null
spec: {}
updateStrategy:
changeBudget: {}
version: 7.7.0
status:
availableNodes: 8
health: yellow
phase: Ready
We recently upgraded this cluster to run on 1.17.9, but even the last nodepool for this cluster was upgraded more than ~20 hours before these issues cropped up, so I'm not sure if that's related. We are running version 1.2.1 of ECK and the cluster is provisioned with Azure Kubernetes Service.