I'm using ECK Operator 1.0.0-beta1 running on Rancher 2.0.
I have a custom image for Elasticsearch which adds an off-cluster NFS share for snapshot backups. This capability works correctly, but when I go to upgrade the cluster (such as from 7.4.0 to 7.4.1) I see the following behavior:
- Kubernetes tries to remove the last node in the cluster
- This seems to timeout which results in the pod being killed (I think)
- Then the entire cluster detects "Readiness Probe Failed" and falls over
- The cluster comes back on it's own, and the killed node now has the new version
- Repeat for every node in the cluster
No data is lost during this, but the cluster restarts once for every pod.
The Dockerfile looks like this:
FROM docker.elastic.co/elasticsearch/elasticsearch:7.4.1 RUN yum -y install nfs-utils RUN mkdir /mnt/snapshots COPY ./my-start.sh /usr/local/bin/my-start.sh ENTRYPOINT ["/usr/local/bin/my-start.sh"]
The my-start.sh script adds a mount command before sourcing the original entrypoint:
#!/bin/bash mount -vvv -t nfs -o nolock my-store:/volume/snapshots /mnt/snapshots /usr/local/bin/docker-entrypoint.sh umount /mnt/snapshots
I think perhaps the problem is that the umount is never reached due to the sigterm call, but I don't know how to confirm this.
Why would a shutdown timeout of a single instance cause the entire cluster to flap?