Best practices for ECK on EKS with multi-AZ nodegroups and EBS volumes during node upgrades

Hi all,

I'm running an Elasticsearch cluster on EKS using the ECK operator and I'm trying to understand the best way to handle node upgrades in a multi-AZ setup, especially with EBS volumes involved.

Here’s a simplified example (I know a 2-node cluster isn’t recommended — this is just for illustration):
I have 2 Elasticsearch nodes, one in AZ A and one in AZ B. Each pod is scheduled in its respective AZ and uses an EBS volume in the same zone. The issue comes during nodegroup upgrades: EKS recreates the nodes in random AZs (e.g., AZ B and AZ C), so the pod that was in AZ A might get rescheduled to AZ C. Since EBS volumes are AZ-bound, the volume from AZ A can’t be attached anymore, and the cluster ends up in yellow status due to the missing data node.

AWS support confirmed that there’s no way to control the AZ placement during nodegroup upgrades — it’s random.

My idea is to create three separate nodegroups (nodegroup-az-a, nodegroup-az-b, nodegroup-az-c), each pinned to a specific AZ. This way, when a nodegroup is upgraded, the new nodes are always recreated in the same AZ. Then, I would define multiple nodeSets in the Elasticsearch manifest, each using a nodeSelector to target the corresponding AZ/nodegroup. This should ensure pods stay in the correct zone and can always access their EBS volumes.

Here’s an example manifest I’m considering:

apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
  name: elasticsearch-cluster
spec:
  version: 8.12.0
  nodeSets:
    - name: az-a
      count: 1
      config:
        node.roles: ["data", "ingest", "master"]
      podTemplate:
        spec:
          nodeSelector:
            topology.kubernetes.io/zone: eu-west-1a
    - name: az-b
      count: 1
      config:
        node.roles: ["data", "ingest", "master"]
      podTemplate:
        spec:
          nodeSelector:
            topology.kubernetes.io/zone: eu-west-1b
    - name: az-c
      count: 1
      config:
        node.roles: ["data", "ingest", "master"]
      podTemplate:
        spec:
          nodeSelector:
            topology.kubernetes.io/zone: eu-west-1c

Has anyone implemented something similar? Are there any caveats or better approaches to ensure data node stability and volume availability during upgrades?

Thanks in advance for any insights!