Elasticsearch cluster pods are not restarting for version upgrade

  • Upgrading from ES 7.6.1 to 7.9.1
  • Upgrading elastic-cloud-operator from 1.1.2 to 1.2.1
  • cluster status is green with no issues
    For some reason our operator is stuck at "do_not_restart_healthy_node_if_MaxUnavailable_reached" message during the upgrade:
{"log.level":"info","@timestamp":"2020-09-10T01:04:04.001Z","log.logger":"zen2","message":"Ensuring no voting exclusions are set","service.version":"1.2.1-b5316231","service.type":"eck","ecs.version":"1.4.0","namespace":"elastic-system","es_name":"elasticsearch"}
{"log.level":"info","@timestamp":"2020-09-10T01:04:04.016Z","log.logger":"migrate-data","message":"Setting routing allocation excludes","service.version":"1.2.1-b5316231","service.type":"eck","ecs.version":"1.4.0","namespace":"elastic-system","es_name":"elasticsearch","value":"none_excluded"}
{"log.level":"info","@timestamp":"2020-09-10T01:04:04.032Z","log.logger":"driver","message":"Cannot restart some nodes for upgrade at this time","service.version":"1.2.1-b5316231","service.type":"eck","ecs.version":"1.4.0","namespace":"elastic-system","es_name":"elasticsearch","failed_predicates":{"do_not_restart_healthy_node_if_MaxUnavailable_reached":["elasticsearch-es-data-5","elasticsearch-es-data-4","elasticsearch-es-data-3","elasticsearch-es-data-2","elasticsearch-es-masters-2","elasticsearch-es-masters-1","elasticsearch-es-masters-0"]}}

our elasticsearch config (SHORT VERSION):

apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
  annotations:
    common.k8s.elastic.co/controller-version: 1.2.1
    meta.helm.sh/release-name: elasticsearch
    meta.helm.sh/release-namespace: elastic-system
  labels:
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/version: 7.9.1
    common.k8s.elastic.co/type: elasticsearch
    elasticsearch.k8s.elastic.co/cluster-name: elasticsearch
    elasticsearch.k8s.elastic.co/statefulset-name: elasticsearch-es-data
  name: elasticsearch
  namespace: elastic-system
spec:
 nodeSets:
  - config:
    count: 3
    name: masters
  - config:
    count: 6
    name: data
  updateStrategy:
    changeBudget:
      maxSurge: 1
      maxUnavailable: 0
  version: 7.9.1
status:
  availableNodes: 9
  health: green
  phase: ApplyingChanges

elastic-cloud-operator statefulset (SHORT VERSION):

apiVersion: apps/v1
kind: StatefulSet
metadata:
  annotations:
    meta.helm.sh/release-name: elastic-cloud-operator
    meta.helm.sh/release-namespace: elastic-system
  labels:
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/version: 1.2.1
    control-plane: elastic-cloud-operator
  name: elastic-cloud-operator
  namespace: elastic-system
spec:
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      control-plane: elastic-cloud-operator
  serviceName: elastic-cloud-operator
  template:
    spec:
      containers:
      - args:
        - manager
        - --enable-webhook
        - --log-verbosity=0
        - --metrics-port=9090
        env:
        - name: OPERATOR_NAMESPACE
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
        - name: OPERATOR_IMAGE
          value: docker.elastic.co/eck/eck-operator:1.2.1
        - name: WEBHOOK_SECRET
          value: elastic-cloud-operator-webhook-server-cert
        image: docker.elastic.co/eck/eck-operator:1.2.1
        imagePullPolicy: IfNotPresent
  updateStrategy:
    rollingUpdate:
      partition: 0
    type: RollingUpdate
status:
  collisionCount: 0
  currentReplicas: 1
  currentRevision: elastic-cloud-operator-75ccb4b8bb
  observedGeneration: 5
  readyReplicas: 1
  replicas: 1
  updateRevision: elastic-cloud-operator-75ccb4b8bb
  updatedReplicas: 1

I think the problem here is that you are running with maxUnavailable: 0 that leaves no room for the operator to remove a node in order to upgrade it. For a rolling upgrade at least one Pod at a time needs to be taken down for upgrade so you have to set it to at least 1 for the upgrade to go through. See also https://www.elastic.co/guide/en/cloud-on-k8s/1.2/k8s-update-strategy.html