Upgrade of custom-image statefulset causes cluster-wide Readiness probe failed

Hello,

I'm using ECK Operator 1.0.0-beta1 running on Rancher 2.0.

I have a custom image for Elasticsearch which adds an off-cluster NFS share for snapshot backups. This capability works correctly, but when I go to upgrade the cluster (such as from 7.4.0 to 7.4.1) I see the following behavior:

  • Kubernetes tries to remove the last node in the cluster
  • This seems to timeout which results in the pod being killed (I think)
  • Then the entire cluster detects "Readiness Probe Failed" and falls over
  • The cluster comes back on it's own, and the killed node now has the new version
  • Repeat for every node in the cluster

No data is lost during this, but the cluster restarts once for every pod.

The Dockerfile looks like this:

FROM docker.elastic.co/elasticsearch/elasticsearch:7.4.1

RUN yum -y install nfs-utils
RUN mkdir /mnt/snapshots
COPY ./my-start.sh /usr/local/bin/my-start.sh
ENTRYPOINT ["/usr/local/bin/my-start.sh"]

The my-start.sh script adds a mount command before sourcing the original entrypoint:

#!/bin/bash
mount -vvv -t nfs -o nolock my-store:/volume/snapshots /mnt/snapshots
/usr/local/bin/docker-entrypoint.sh
umount /mnt/snapshots

I think perhaps the problem is that the umount is never reached due to the sigterm call, but I don't know how to confirm this.

Why would a shutdown timeout of a single instance cause the entire cluster to flap?

I made some progress here using the following:

#!/bin/bash
mount -vvv -t nfs -o nolock my-store:/volume/snapshots /mnt/snapshots
source /usr/local/bin/docker-entrypoint.sh
umount /mnt/snapshots

By sourcing the entrypoint (instead of calling it,) I think the sigterm results in the umount line being reached.

Adding a lifecycle preStop to the container template also seems to prevent the pod from hanging.

lifecycle:
  preStop:
    exec:
      command: ["/usr/bin/umount", "/mnt/snapshots" ]

Is one of these approaches better?

Hey @Zorlack,

You could maybe get rid of the custom Docker image by:

  • adding an init container that does the mount
  • using your preStop hook to do the umount

This way you don't have to deal with building your own image and keeping it up-to-date.

Why would a shutdown timeout of a single instance cause the entire cluster to flap?

This is not expected. I'd like to understand it better.

  • Can you share your Elasticsearch yaml manifest?
  • What do you mean with the entire cluster detects "Readiness Probe Failed" and falls over? All Pods become non-ready so the service cannot route to the cluster?
  • Can you share some logs of the operator and Elasticsearch while this happens?

Hello @sebgl

I've been able to recreate this issue using a custom image, but the problem goes away when I use lifecycle exec commands.

To demonstrate the behavior I've made a short video which starts when I apply a change from 7.4.0 to 7.4.1: https://youtu.be/4icmwoyN8uY

(I have operator log files if you're interested in chasing this behavior down - but I think it comes down to my image not exiting cleanly.)

I have eliminated this behavior by moving my logic to lifecycle exec commands like so:

cat <<EOF | kubectl -n test apply -f -
apiVersion: elasticsearch.k8s.elastic.co/v1beta1
kind: Elasticsearch
metadata:
  name: quickstart
spec:
  version: 7.4.0
  nodeSets:
  - name: default
    count: 3
    config:
      node.master: true
      node.data: true
      node.ingest: true
      path.repo: [ "/var/local" ]
      xpack.security.authc.realms:
        native:
          native1:
            order: 1
    podTemplate:
      spec:
        containers:
          - name: elasticsearch
            resources:
              limits:
                memory: 2G
                cpu: 2
            env:
            - name: ES_JAVA_OPTS
              value: "-Xms1g -Xmx1g"
            securityContext:
              capabilities:
                add:
                  - SYS_ADMIN
            # Important: You must mount to a path which already exists in the image, because postStart executes too late to create the mount point.  
            # I used /var/local because it was empty and seemed reasonable.
            lifecycle:
              postStart:
                exec:                      
                  command:
                    - "sh"
                    - "-c"
                    - >
                      yum -y install nfs-utils &&
                      mount -vvv -t nfs -o nolock nfs-server:/volume1/search-quickstart /var/local
              preStop:
                exec:
                  command: ["/usr/bin/umount", "/var/local" ]
    volumeClaimTemplates:
      - metadata:
          name: elasticsearch-data
        spec:
          accessModes:
          - ReadWriteOnce
          resources:
            requests:
              storage: 5Gi
          storageClassName: local-path
EOF

The one problem with this, as a solution, is that now I have to install a bunch of packages everytime I initialize a pod. But it's a cleaner solution than a custom image overall.

Many thanks!