Deploying ECK on AWS EKS with i3 and d2 instances and local volumes

I'm attempting to deploy Elasticsearch w/ ECK on EKS with i3 hot nodes and d2 warm nodes, with the EC2 instances created through Terraform as part of an autoscaling group so I can have failover across availability zones. Any solution that doesn't allow the loss of an AZ won't work for me. The best practices for this are unclear and I suspect I'm taking a wrong-headed approach.

The i3 series have local NVME disks under /dev/nvme0n1, /dev/nvme1n1, etc.
The d2 series have local HDD under /dev/sdb, /dev/sdc, etc.
Neither is mounted by default or available to Kubernetes, which is more of an EKS shortcoming than Elasticsearch (I have an open support request with AWS about this), but something it seems others must have solved - especially for the recommended aws.data.highio.i3 and aws.data.highstorage.d2.

My potentially wrong-headed first attempt was a userdata script that used mdadm to create a raid0 array of the disks, mounted under /mnt/data, with a plan to specify the local.path for the volume, but as far as I can tell this can only be set in a PV, not in the PVC for a StatefulSet. Creating a PV for each manually feels dirty and against the general design of ECK. Updating the count for a nodeSet should be sufficient, without also creating a PV (I'd automate this with Terraform).

My likely-terrible second attempt also used raid0, but mounted in /var/lib/docker/overlay2, then service docker restart, potentially switch to using only ephemeral storage and emptyDir - but something else is needed before k8s updates the available ephemeral storage.

I see Local Persistent Volumes as a potential solution, through sig-storage-local-static-provisioner, possibly with eks-nvme-ssd-provisioner for the i3 instances, though I'm not sure about an equivalent for the d2 instances.

I can provide the terraform config, userdata scripts, etc if needed though they might be more of a distraction if this is a wrong-headed approach.

Hi,

I can't tell for the raid0 approach but I did some tests on c5d instances with the following manifest. It is using the local storage provisioner and a small bash script to create the required symlinks.

I think it can be customized for the i3/d2 instances.

The local storage provisioner approach is working for me after a bit of experimentation.

userdata script to initialize local volumes
My current userdata script handles i3, i3en, and d2 instance types with one or more local volumes. It uses the nvme tool for nvme devices or lsblk for others, and uses the AWS block device mapping API to check which are ephemeral. Single devices are mounted in /mnt/data directly, multiple devices are added to a raid 0 array at /dev/md0, then mounted at /mnt/data

#!/bin/bash
set -o xtrace
echo '*    - nofile 65536' >> /etc/security/limits.conf
echo 'root - nofile 65536' >> /etc/security/limits.conf
echo "session required pam_limits.so" >> /etc/pam.d/common-session
echo 'vm.max_map_count=262144' >> /etc/sysctl.conf


# Identify the ephemeral volumes using either the nvme command for i3 disks or lsblk and the AWS API to query block device mappings
# https://aws.amazon.com/premiumsupport/knowledge-center/ec2-linux-instance-store-volumes/

if [[ -e /dev/nvme0n1 ]]; then
  yum install nvme-cli -y
  instance_stores=$(nvme list | awk '/Instance Storage/ {print $1}')
  echo $instance_stores
else
  OSDEVICE=$(sudo lsblk -o NAME -n | grep -v '[[:digit:]]' | sed "s/^sd/xvd/g")
  BDMURL="http://169.254.169.254/latest/meta-data/block-device-mapping/"

  instance_stores=$(
  for bd in $(curl -s $BDMURL); do
    MAPDEVICE=$(curl -s $BDMURL/$bd/ | sed "s/^sd/xvd/g");
    if grep -wq $MAPDEVICE <<< "$OSDEVICE"; then
      echo $MAPDEVICE
    fi
  done
  )
  echo $instance_stores
fi

# If one volume is found, mount it at /mnt/data
# If multple, create a raid0 array as /dev/md0 and mount it as /mnt/data
# A local-storage-provisioner using /mnt as the hostPath will pick up either of these
if [[ -n "$instance_stores" ]]; then
  count=$(echo $instance_stores | wc -w)
  if [[ $count -eq 1 ]]; then
    mkdir -p /mnt/data
    mkfs.ext4 $instance_stores
    echo $instance_stores /mnt/data ext4 defaults,noatime 0 2 >> /etc/fstab
    mount -a
  elif [[ $count -gt 1 ]]; then
    yum install mdadm -y
    mkdir -p /mnt/data
    mdadm --create --verbose --level=0 /dev/md0 --name=DATA --raid-devices=$count $instance_stores
    mdadm --wait /dev/md0
    mkfs.ext4 /dev/md0
    mdadm --detail --scan >> /etc/mdadm.conf
    echo /dev/md0 /mnt/data ext4 defaults,noatime 0 2 >> /etc/fstab
    mount -a
  fi
fi

    /etc/eks/bootstrap.sh --apiserver-endpoint '${var.eks_endpoint}' --b64-cluster-ca '${var.eks_ca_data}' --kubelet-extra-args '--node-labels=${var.node_labels}' '${var.cluster_name}'

Node labels

  • instanceVolume=present
  • storage={high,fast}

Deploy local-storage-provisioner
I cloned the helm chart from https://github.com/kubernetes-sigs/sig-storage-local-static-provisioner/tree/master/helm, pushed it to my local repo, then used these values for the chart. Note that the hostDir in /mnt, not /mnt/data - the local volume provisioner looks for directories under the specified path, and expects each child directory to be a mount point. Using /mnt/data will report an error since /mnt/data/lost+found exists under it, while also not finding /mnt/data as a local volume.

common:
 rbac:
   pspEnabled: true
serviceMonitor:
  enabled: true
classes:
  - name: local-storage
    hostDir: "/mnt"
    mountDir: "/mnt"
    storageClass: false
daemonset:
  nodeSelector:
    localVolume: present
  • hostDir=/mnt, not /mnt/data - the local volume provisioner looks for directories under the specified path, and expects each child directory to be a mount point. Using /mnt/data will report an error since /mnt/data/lost+found exists under it, while also not finding /mnt/data as a local volume.
  • nodeSelector applies this provisioner to nodes with label localVolume=present
  • storageClass=false because I already defined the storageclass - others might want to let this create it for them

Autoscaling group vs explicit instances vs managed node groups
I prefer to use autoscaling groups that span multiple zones instead of creating ec2 instances explicitly. The cluster-autoscaler isn't aware of the local volume provisioner yet, though there are issues raised to request it. This means that it will never add new workers based on a new elasticsearch node being configured in a nodeSet, so min/max/desired would need to be set manually.
I'd like to use the Managed Node Groups, but these don't support custom userdata scripts or custom security groups yet, so I'm holding off on them for now. Could always use a privileged init-container instead of userdata, though it feels messy.

Elasticsearch nodeSets
The available size isn't exactly what AWS lists, so I checked the amount reported by the local-volume-provisoner and used the nearest Gi for the requested storage.

  • i3en.2xlarge reports 4960178446336 bytes, so I used 4619Gi
  • i3.2xlarge has 1870043070464 bytes, so I used 1740Gi
  • d2.2xlarge has 11906751668224 bytes, so I used 11089Gi

All use storageClass: local-storage

I also have some daemonsets running, so the full instance cpu/mem is not available.
For my environment these allocations worked well (following the advice that half the memory should be heap, half left for os/cache/etc). Note that i3.2xlarge and d2.2xlarge have 61Gi instead of 64:

  • i3en.2xlarge: 7 cpu, 60Gi ram, 30g heap, 4619Gi storage
  • i3.2xlarge: 7cpu, 55Gi ram, 27g heap, 1740Gi storage
  • d2.2xlarge: 7cpu, 55Gi ram, 27g heap, 11089Gi storage

I went with the i3en.2xlarge to have more storage per instance in my hot tier, and to have the "up to 25 Gbit/s" speed instead of "up to 10". Note that this is the max burst speed, with a baseline that's much lower and the instance will be throttled harshly if it exceeds the baseline for too long. This nerfed a few of my nodes when doing the initial backup to s3, and also when replacing a nodeSet and migrating data.

My current nodeSets for hot and warm tier (I generate it from a local helm chart through terraform, so I'm posting the output from kubectl get elasticsearch -o yaml instead of my original):

spec:
  nodeSets:
  - name: hot2
    config:
      node.attr.data: hot
      node.data: true
      node.ingest: true
      node.master: true
    count: 5
    podTemplate:
      spec:
        affinity:
          nodeAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
              nodeSelectorTerms:
              - matchExpressions:
                - key: storage
                  operator: In
                  values:
                  - fast
        containers:
        - env:
          - name: ES_JAVA_OPTS
            value: -Xms30g -Xmx30g -Xlog:gc*,gc+age=trace,safepoint:file=logs/gc.log:utctime,pid,tags:filecount=2,filesize=10m
          name: elasticsearch
          resources:
            limits:
              cpu: 7
              ephemeral-storage: 400Mi
              memory: 60Gi
            requests:
              cpu: 7
              ephemeral-storage: 400Mi
              memory: 60Gi
        initContainers:
        - command:
          - sh
          - -c
          - |
            bin/elasticsearch-plugin install --batch mapper-size repository-s3
          name: install-plugins
        priorityClassName: elasticsearch
        serviceAccountName: alerts-es
    volumeClaimTemplates:
    - metadata:
        name: elasticsearch-data
      spec:
        accessModes:
        - ReadWriteOnce
        resources:
          requests:
            storage: 4619Gi
        storageClassName: local-storage
  - name: warm
    config:
      node.attr.data: warm
      node.data: true
      node.ingest: false
      node.master: false
    count: 5
    podTemplate:
      spec:
        affinity:
          nodeAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
              nodeSelectorTerms:
              - matchExpressions:
                - key: storage
                  operator: In
                  values:
                  - high
        containers:
        - env:
          - name: ES_JAVA_OPTS
            value: -Xms27g -Xmx27g -Xlog:gc*,gc+age=trace,safepoint:file=logs/gc.log:utctime,pid,tags:filecount=2,filesize=10m
          name: elasticsearch
          resources:
            limits:
              cpu: 7
              ephemeral-storage: 400Mi
              memory: 55Gi
            requests:
              cpu: 7
              ephemeral-storage: 400Mi
              memory: 55Gi
        initContainers:
        - command:
          - sh
          - -c
          - |
            bin/elasticsearch-plugin install --batch mapper-size repository-s3
          name: install-plugins
        priorityClassName: elasticsearch
        serviceAccountName: alerts-es
    volumeClaimTemplates:
    - metadata:
        name: elasticsearch-data
      spec:
        accessModes:
        - ReadWriteOnce
        resources:
          requests:
            storage: 11089Gi
        storageClassName: local-storage
  secureSettings:
  - secretName: alerts-es-backup
  version: 7.7.0

Things worth noting:

  • added a priorityClass for elasticsearch so other pods don't compete for space on the special nodes
  • The pod gets killed for using over 100Gi of ephemeral storage so I bumped it to 400Gi
  • I'm using secureSettings for the s3 access key instead of allowing all pods to hit that s3 bucket
  • haven't added the dedicated masters yet because I haven't found any documentation around their storage requirements or recommended sizing

My performance is significantly worse with the 5 i3en.2xlarge hot nodes and 5 d2.2xlarge warm nodes, even when querying something that should only be hitting the hot tier. The hot tier has 88 shards per instance, 30g heap, 2 nvme local instances behind mdadm raid0 w/ default chunk size. The warm tier has 224 shards per instance, 27g heap, 6 local instances behind mdadm raid0 w/ default chunk size.

I had been using 6 m5.4xlarge with 4tb of EBS each, then moved to 6tb. This was much faster than the current performance, which suggests that I'm either hitting a cpu bottleneck (8 instead of 16), or my raid0 array needs some tuning if the default chunk size doesn't perform well for the nvme drives.

For the i3en.2xlarge worker I could either run one elasticsearch instance with 60g memory reserved, 30g heap, and mdadm to stripe the two 2.5gb disks -- or I could mount both drives separately and create two elasticsearch instances per i3en.2xlarge instance.

Similar question for the d2.2xlarge instances. They have 8 cores, 61g ram, 6x2tb HDD, so either raid0 to stripe for a single large instance or six instances, each with ~10g ram and 5g heap, which seems small. I'd likely move to h1.4xlarge nodes in that case to have 16 cores, 64g ram, and 2x2tb drives.

Any thoughts on which approach would work best?