Elasticsearch cluster pods are not restarting for version upgrade

  • Upgrading from ES 7.6.1 to 7.9.1
  • Upgrading elastic-cloud-operator from 1.1.2 to 1.2.1
  • cluster status is green with no issues
    For some reason our operator is stuck at "do_not_restart_healthy_node_if_MaxUnavailable_reached" message during the upgrade:
{"log.level":"info","@timestamp":"2020-09-10T01:04:04.001Z","log.logger":"zen2","message":"Ensuring no voting exclusions are set","service.version":"1.2.1-b5316231","service.type":"eck","ecs.version":"1.4.0","namespace":"elastic-system","es_name":"elasticsearch"}
{"log.level":"info","@timestamp":"2020-09-10T01:04:04.016Z","log.logger":"migrate-data","message":"Setting routing allocation excludes","service.version":"1.2.1-b5316231","service.type":"eck","ecs.version":"1.4.0","namespace":"elastic-system","es_name":"elasticsearch","value":"none_excluded"}
{"log.level":"info","@timestamp":"2020-09-10T01:04:04.032Z","log.logger":"driver","message":"Cannot restart some nodes for upgrade at this time","service.version":"1.2.1-b5316231","service.type":"eck","ecs.version":"1.4.0","namespace":"elastic-system","es_name":"elasticsearch","failed_predicates":{"do_not_restart_healthy_node_if_MaxUnavailable_reached":["elasticsearch-es-data-5","elasticsearch-es-data-4","elasticsearch-es-data-3","elasticsearch-es-data-2","elasticsearch-es-masters-2","elasticsearch-es-masters-1","elasticsearch-es-masters-0"]}}

our elasticsearch config (SHORT VERSION):

apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
  annotations:
    common.k8s.elastic.co/controller-version: 1.2.1
    meta.helm.sh/release-name: elasticsearch
    meta.helm.sh/release-namespace: elastic-system
  labels:
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/version: 7.9.1
    common.k8s.elastic.co/type: elasticsearch
    elasticsearch.k8s.elastic.co/cluster-name: elasticsearch
    elasticsearch.k8s.elastic.co/statefulset-name: elasticsearch-es-data
  name: elasticsearch
  namespace: elastic-system
spec:
 nodeSets:
  - config:
    count: 3
    name: masters
  - config:
    count: 6
    name: data
  updateStrategy:
    changeBudget:
      maxSurge: 1
      maxUnavailable: 0
  version: 7.9.1
status:
  availableNodes: 9
  health: green
  phase: ApplyingChanges

elastic-cloud-operator statefulset (SHORT VERSION):

apiVersion: apps/v1
kind: StatefulSet
metadata:
  annotations:
    meta.helm.sh/release-name: elastic-cloud-operator
    meta.helm.sh/release-namespace: elastic-system
  labels:
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/version: 1.2.1
    control-plane: elastic-cloud-operator
  name: elastic-cloud-operator
  namespace: elastic-system
spec:
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      control-plane: elastic-cloud-operator
  serviceName: elastic-cloud-operator
  template:
    spec:
      containers:
      - args:
        - manager
        - --enable-webhook
        - --log-verbosity=0
        - --metrics-port=9090
        env:
        - name: OPERATOR_NAMESPACE
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
        - name: OPERATOR_IMAGE
          value: docker.elastic.co/eck/eck-operator:1.2.1
        - name: WEBHOOK_SECRET
          value: elastic-cloud-operator-webhook-server-cert
        image: docker.elastic.co/eck/eck-operator:1.2.1
        imagePullPolicy: IfNotPresent
  updateStrategy:
    rollingUpdate:
      partition: 0
    type: RollingUpdate
status:
  collisionCount: 0
  currentReplicas: 1
  currentRevision: elastic-cloud-operator-75ccb4b8bb
  observedGeneration: 5
  readyReplicas: 1
  replicas: 1
  updateRevision: elastic-cloud-operator-75ccb4b8bb
  updatedReplicas: 1

I think the problem here is that you are running with maxUnavailable: 0 that leaves no room for the operator to remove a node in order to upgrade it. For a rolling upgrade at least one Pod at a time needs to be taken down for upgrade so you have to set it to at least 1 for the upgrade to go through. See also https://www.elastic.co/guide/en/cloud-on-k8s/1.2/k8s-update-strategy.html

I am running on a similar setup, but I am using the default update strategy that has maxUnavailable: 1.

  • ES version is at 7.8.0 and not changing
  • Upgrading elastic-cloud operator from 1.1.0 to 1.2.1

Besides those differences all my symptoms are the same.

edit:
Added the logs for the reconciliation cycle.

[
  {
    "log.level": "debug",
    "@timestamp": "2020-10-22T03:20:26.264Z",
    "log.logger": "observer",
    "message": "Retrieving cluster state",
    "service.version": "1.2.1-b5316231",
    "service.type": "eck",
    "ecs.version": "1.4.0",
    "es_name": "elky",
    "namespace": "elastic-cluster"
  },
  {
    "log.level": "info",
    "@timestamp": "2020-10-22T03:20:28.749Z",
    "log.logger": "elasticsearch-controller",
    "message": "Starting reconciliation run",
    "service.version": "1.2.1-b5316231",
    "service.type": "eck",
    "ecs.version": "1.4.0",
    "iteration": 147,
    "namespace": "elastic-cluster",
    "es_name": "elky"
  },
  {
    "log.level": "debug",
    "@timestamp": "2020-10-22T03:20:28.749Z",
    "log.logger": "es-validation",
    "message": "validate create",
    "service.version": "1.2.1-b5316231",
    "service.type": "eck",
    "ecs.version": "1.4.0",
    "name": "elky"
  },
  {
    "log.level": "info",
    "@timestamp": "2020-10-22T03:20:29.638Z",
    "log.logger": "zen2",
    "message": "Ensuring no voting exclusions are set",
    "service.version": "1.2.1-b5316231",
    "service.type": "eck",
    "ecs.version": "1.4.0",
    "namespace": "elastic-cluster",
    "es_name": "elky"
  },
  {
    "log.level": "info",
    "@timestamp": "2020-10-22T03:20:29.790Z",
    "log.logger": "migrate-data",
    "message": "Setting routing allocation excludes",
    "service.version": "1.2.1-b5316231",
    "service.type": "eck",
    "ecs.version": "1.4.0",
    "namespace": "elastic-cluster",
    "es_name": "elky",
    "value": "none_excluded"
  },
  {
    "log.level": "debug",
    "@timestamp": "2020-10-22T03:20:29.990Z",
    "log.logger": "driver",
    "message": "Applying predicates",
    "service.version": "1.2.1-b5316231",
    "service.type": "eck",
    "ecs.version": "1.4.0",
    "maxUnavailableReached": false,
    "allowedDeletions": 1
  },
  {
    "log.level": "info",
    "@timestamp": "2020-10-22T03:20:29.993Z",
    "log.logger": "driver",
    "message": "Cannot restart some nodes for upgrade at this time",
    "service.version": "1.2.1-b5316231",
    "service.type": "eck",
    "ecs.version": "1.4.0",
    "namespace": "elastic-cluster",
    "es_name": "elky",
    "failed_predicates": {
      "only_restart_healthy_node_if_green_or_yellow": [
        ...
        "elky-es-logging-2",
        "elky-es-logging-1",
        "elky-es-logging-0",
        "elky-es-master-2",
        "elky-es-master-1",
        "elky-es-master-0"
      ]
    }
  },
  {
    "log.level": "debug",
    "@timestamp": "2020-10-22T03:20:29.993Z",
    "log.logger": "driver",
    "message": "No pod deleted during rolling upgrade",
    "service.version": "1.2.1-b5316231",
    "service.type": "eck",
    "ecs.version": "1.4.0",
    "es_name": "elky",
    "namespace": "elastic-cluster"
  },
  {
    "log.level": "debug",
    "@timestamp": "2020-10-22T03:20:30.020Z",
    "log.logger": "driver",
    "message": "Statefulset not reconciled",
    "service.version": "1.2.1-b5316231",
    "service.type": "eck",
    "ecs.version": "1.4.0",
    "reason": "pod not upgraded"
  },
  {
    "log.level": "info",
    "@timestamp": "2020-10-22T03:20:30.021Z",
    "log.logger": "elasticsearch-controller",
    "message": "Ending reconciliation run",
    "service.version": "1.2.1-b5316231",
    "service.type": "eck",
    "ecs.version": "1.4.0",
    "iteration": 147,
    "namespace": "elastic-cluster",
    "es_name": "elky",
    "took": 1.271987487
  }
]