I am looking for general guidance on how to keep an Elasticsearch cluster on K8s happy while doing a K8s node update.
We are running an HA microservice-based SaaS application on Kubernetes, we have installations on AWS and on Azure (using these providers respective hosted K8s clusters, i.e., the EKS or AKS services, respectively). Elasticsearch is part of our stack.
We currently deploy the entire stack, including the ES cluster, with our own Helm chart, using our own ES images (based on the open-source edition of ES).
Note that we are aware of the ECK operator and quite successfully evaluated it (e.g., the way it handles an ES update while keeping the cluster green at all times is really nice), but due to legal concerns we could not actually use it (This assessment of our legal guys is currently changing, so using the ECK operator MIGHT be an option in the future).
All indexes are configured to have at least two shard replica, even in the smallest installations we have at least three ES nodes, so we can successfully tolerate the (unplanned) outage of at least one ES node at any given time. We usually deploy the SaaS across three availability zones, and to properly handle ES's "rack awareness", we create a dedicated statefulset for ES for each availability zone (pretty much the same as what the ECK operator does, BTW). Elasticsearch indexes are stored on PVCs created through the STS volumeClaimTemplate functionality. If an ES pod crashes, the existing volume will be mounted into the restarting pod and index data is preserved. With that setup, we can tolerate the outage of an entire AZ.
One challenge that we are facing is that of updating the Kubernetes(!) worker nodes while keeping the whole SaaS up and running. There are many challenges here, most of them rooted in our own software ;-), which carries a 20+ year history, but keeping the ES cluster in our stack happy during a K8s node update is also one of our challenges here.
To avoid confusion regarding the somewhat overused term "node", which is used both in the meaning of "machine managed by Kubernetes on which our pods are running" as well as "Elasticsearch instance", in the remainder of this text, I will always write "K8s node" to refer to the former, and "ES node" to refer to the latter.
We played around with a rolling update of the K8s worker node "in-place", i.e., having one node group in the K8s cluster, and updating its launch config (i.e., choose new machine image with the updated OS), and then letting the cloud provider's K8s node update magic do the work.
Due to various restrictions, we concluded that we cannot go that route, so instead we will be using a blue/green approach, i.e., we will have two K8s node groups, one group's size (e.g., the blue group) is initially set to the required number of K8s nodes (with node autoscaling), the other group's (e.g., "green") size is initially set to size 0.
When it is time to update the K8s nodes, we will change the machine image of the "green" group to the desired version, increase the size of the "green" group to the required amount, then cordon the K8s nodes of the blue group, then drain the blue K8s nodes one by one.
As a result, the pods will be evicted from the blue group and rescheduled on the new K8s nodes of the green group. Once all blue K8s nodes have been drained, we set that K8s node group's size to 0. For the next K8s node update, we will do the same but go from green to blue.
The question is now: how do we keep our ES cluster happy and ideally in "green" state during this procedure?
We looked around a bit for guidance and best practices, but we couldn't find much (this might be due to the whole ES node vs. K8s node confusion... all findings generally relate to updating ES nodes, nothing seems to consider the updating of the K8s nodes inside which the ES nodes live...).
With our past positive experience with the ECK operator, we also played around with it to see what it does in the case of a K8s node update - alas, it doesn't seem to do much related to this challenge at all (e.g., it does set PDBs, but that mainly caused the problem that it prevented the entire rolling K8s node update, cause the drain wouldn't proceed as the PDBs were too restrictive).
So the first question is: is there a guide on how to setup and handle an ES cluster for a K8s(!) node update that we just failed to find?
Does the ECK operator in some way handle the K8s node update, and we just didn't use it properly?
We also did a bit of thinking on our own and came up with a few approaches:
We simply trigger the drain of a K8s node, leading to an eviction of the ES pod, i.e., the pod will go into terminating state, a SIGTERM is sent to ES, and if it doesn't exit by itself, after the terminationGracePeriod is up, it will get SIGKILLed.
This will make the cluster go yellow and shards will start to be re-allocated to the remaining ES nodes. In the mean time, the evicted ES pod will restart on a K8s node in the other K8s node group, will eventually reach ready and join the cluster.
Again, shards will be reshuffled automatically by ES to balance the number of shards evenly amongst all ES nodes.
- it is simple!
- the cluster will turn yellow, so we have no remaining redundancy during the update
- the cluster has a good chance to break (or at least go red temporarily):
Consider the situation that there is quite a bit of data in our ES cluster, so re-allocating all shards will take some time. So even once the ES pod is restared on the new K8s node, the cluster might remain in yellow state for quite some time.
However, from the pure K8s perspective, the new ES pod will be running and ready, so eventually the K8s node update will proceed, draining the next K8s node and evicting the next ES pod.
If this happens while the cluster is still yellow, we will inevitably(?) end up with a red cluster and a user-observable outage of our SaaS => FAIL
If we do really bad, we might even end up with losing shards permanently => EPIC FAIL
We currently use a simple get request to /_cluster/state to determine the readiness of an ES pod. The idea is that we change that and only consider an ES pod ready if the entire ES cluster is in "green" state.
The benefit over the naive approach is that now the fact that ES isn't completely happy (i.e., yellow instead of green) becomes visible on K8s level. And since the K8s node update will wait for all evicted pods to be running AND ready on their new K8s node,
it will not proceed with draining the next K8s node until the ES cluster is green again. This should avoid the situation that the K8s node update proceeds to early and takes the ES cluster down.
- still simple
- the approach definitely feels a bit like abusing the readiness probe, and we have to compensate this (e.g., a K8s service used by our ES clients to access ES has to be convinced to also send traffic to non-ready pods by setting publishNotReadyAddresses=true on the service)
- the cluster will still go yellow for possibly quite a long time, leaving us without redundancy.
Instead of cheating with the readiness probe, we add a preStop hook to the ES pods that will will only allow the ES pod to terminate if the cluster state is green respectively wait until it is green. If we increase the terminationGracePeriod sufficiently (to the order of hours instead of the default 30secs), the result is that an ES pod that should be evicted
will stay in terminating mode without ES getting the SIGTERM until the cluster is green again. That way, we avoid the situation that ES Pod evictions continue with the cluster still in yellow state (and thus taking it to red).
- no cheating with readiness probes
- the cluster will still go yellow
- the long terminationGracePeriod might not be respected everywhere... but based on our testing it IS respected, in particular during K8s node draining
To avoid the cluster going even yellow during the K8s node update, the idea is to have a preStop hook that uses ES's cluster-level shard allocation filtering feature (Cluster-level shard allocation and routing settings | Elasticsearch Guide [7.16] | Elastic)
to let ES move all shards currently on the pod that should be evicted to be reallocated to the remaining ES nodes.
After triggering this "draining" of the ES node, the preStop hook will wait for all shards to be reallocated. Since we have at least two more ES nodes, all shards will find a new home on one of these and the cluster will stay green.
Only when the ES node is devoid of any remaining shards the preStop hook script will exit and only then will that ES node get the SIGTERM. Since we are terminating an already empty ES node, the cluster will stay green.
- in theory this would keep the cluster happy
Any thoughts or comments?