The local storage provisioner approach is working for me after a bit of experimentation.
userdata script to initialize local volumes
My current userdata script handles i3, i3en, and d2 instance types with one or more local volumes. It uses the nvme tool for nvme devices or lsblk for others, and uses the AWS block device mapping API to check which are ephemeral. Single devices are mounted in /mnt/data directly, multiple devices are added to a raid 0 array at /dev/md0, then mounted at /mnt/data
#!/bin/bash
set -o xtrace
echo '*    - nofile 65536' >> /etc/security/limits.conf
echo 'root - nofile 65536' >> /etc/security/limits.conf
echo "session required pam_limits.so" >> /etc/pam.d/common-session
echo 'vm.max_map_count=262144' >> /etc/sysctl.conf
# Identify the ephemeral volumes using either the nvme command for i3 disks or lsblk and the AWS API to query block device mappings
# https://aws.amazon.com/premiumsupport/knowledge-center/ec2-linux-instance-store-volumes/
if [[ -e /dev/nvme0n1 ]]; then
  yum install nvme-cli -y
  instance_stores=$(nvme list | awk '/Instance Storage/ {print $1}')
  echo $instance_stores
else
  OSDEVICE=$(sudo lsblk -o NAME -n | grep -v '[[:digit:]]' | sed "s/^sd/xvd/g")
  BDMURL="http://169.254.169.254/latest/meta-data/block-device-mapping/"
  instance_stores=$(
  for bd in $(curl -s $BDMURL); do
    MAPDEVICE=$(curl -s $BDMURL/$bd/ | sed "s/^sd/xvd/g");
    if grep -wq $MAPDEVICE <<< "$OSDEVICE"; then
      echo $MAPDEVICE
    fi
  done
  )
  echo $instance_stores
fi
# If one volume is found, mount it at /mnt/data
# If multple, create a raid0 array as /dev/md0 and mount it as /mnt/data
# A local-storage-provisioner using /mnt as the hostPath will pick up either of these
if [[ -n "$instance_stores" ]]; then
  count=$(echo $instance_stores | wc -w)
  if [[ $count -eq 1 ]]; then
    mkdir -p /mnt/data
    mkfs.ext4 $instance_stores
    echo $instance_stores /mnt/data ext4 defaults,noatime 0 2 >> /etc/fstab
    mount -a
  elif [[ $count -gt 1 ]]; then
    yum install mdadm -y
    mkdir -p /mnt/data
    mdadm --create --verbose --level=0 /dev/md0 --name=DATA --raid-devices=$count $instance_stores
    mdadm --wait /dev/md0
    mkfs.ext4 /dev/md0
    mdadm --detail --scan >> /etc/mdadm.conf
    echo /dev/md0 /mnt/data ext4 defaults,noatime 0 2 >> /etc/fstab
    mount -a
  fi
fi
    /etc/eks/bootstrap.sh --apiserver-endpoint '${var.eks_endpoint}' --b64-cluster-ca '${var.eks_ca_data}' --kubelet-extra-args '--node-labels=${var.node_labels}' '${var.cluster_name}'
Node labels
- instanceVolume=present
 
- storage={high,fast}
 
Deploy local-storage-provisioner
I cloned the helm chart from https://github.com/kubernetes-sigs/sig-storage-local-static-provisioner/tree/master/helm, pushed it to my local repo, then used these values for the chart. Note that the hostDir in /mnt, not /mnt/data - the local volume provisioner looks for directories under the specified path, and expects each child directory to be a mount point. Using /mnt/data will report an error since /mnt/data/lost+found exists under it, while also not finding /mnt/data as a local volume.
common:
 rbac:
   pspEnabled: true
serviceMonitor:
  enabled: true
classes:
  - name: local-storage
    hostDir: "/mnt"
    mountDir: "/mnt"
    storageClass: false
daemonset:
  nodeSelector:
    localVolume: present
- hostDir=/mnt, not /mnt/data - the local volume provisioner looks for directories under the specified path, and expects each child directory to be a mount point. Using /mnt/data will report an error since /mnt/data/lost+found exists under it, while also not finding /mnt/data as a local volume.
 
- nodeSelector applies this provisioner to nodes with label localVolume=present
 
- storageClass=false because I already defined the storageclass - others might want to let this create it for them
 
Autoscaling group vs explicit instances vs managed node groups
I prefer to use autoscaling groups that span multiple zones instead of creating ec2 instances explicitly. The cluster-autoscaler isn't aware of the local volume provisioner yet, though there are issues raised to request it. This means that it will never add new workers based on a new elasticsearch node being configured in a nodeSet, so min/max/desired would need to be set manually.
I'd like to use the Managed Node Groups, but these don't support custom userdata scripts or custom security groups yet, so I'm holding off on them for now. Could always use a privileged init-container instead of userdata, though it feels messy.
Elasticsearch nodeSets
The available size isn't exactly what AWS lists, so I checked the amount reported by the local-volume-provisoner and used the nearest Gi for the requested storage.
- i3en.2xlarge reports 4960178446336 bytes, so I used 4619Gi
 
- i3.2xlarge has 1870043070464 bytes, so I used 1740Gi
 
- d2.2xlarge has 11906751668224 bytes, so I used 11089Gi
 
All use storageClass: local-storage
I also have some daemonsets running, so the full instance cpu/mem is not available.
For my environment these allocations worked well (following the advice that half the memory should be heap, half left for os/cache/etc). Note that i3.2xlarge and d2.2xlarge have 61Gi instead of 64:
- i3en.2xlarge: 7 cpu, 60Gi ram, 30g heap, 4619Gi storage
 
- i3.2xlarge: 7cpu, 55Gi ram, 27g heap, 1740Gi storage
 
- d2.2xlarge: 7cpu, 55Gi ram, 27g heap, 11089Gi storage
 
I went with the i3en.2xlarge to have more storage per instance in my hot tier, and to have the "up to 25 Gbit/s" speed instead of "up to 10". Note that this is the max burst speed, with a baseline that's much lower and the instance will be throttled harshly if it exceeds the baseline for too long. This nerfed a few of my nodes when doing the initial backup to s3, and also when replacing a nodeSet and migrating data.
My current nodeSets for hot and warm tier (I generate it from a local helm chart through terraform, so I'm posting the output from kubectl get elasticsearch -o yaml instead of my original):
spec:
  nodeSets:
  - name: hot2
    config:
      node.attr.data: hot
      node.data: true
      node.ingest: true
      node.master: true
    count: 5
    podTemplate:
      spec:
        affinity:
          nodeAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
              nodeSelectorTerms:
              - matchExpressions:
                - key: storage
                  operator: In
                  values:
                  - fast
        containers:
        - env:
          - name: ES_JAVA_OPTS
            value: -Xms30g -Xmx30g -Xlog:gc*,gc+age=trace,safepoint:file=logs/gc.log:utctime,pid,tags:filecount=2,filesize=10m
          name: elasticsearch
          resources:
            limits:
              cpu: 7
              ephemeral-storage: 400Mi
              memory: 60Gi
            requests:
              cpu: 7
              ephemeral-storage: 400Mi
              memory: 60Gi
        initContainers:
        - command:
          - sh
          - -c
          - |
            bin/elasticsearch-plugin install --batch mapper-size repository-s3
          name: install-plugins
        priorityClassName: elasticsearch
        serviceAccountName: alerts-es
    volumeClaimTemplates:
    - metadata:
        name: elasticsearch-data
      spec:
        accessModes:
        - ReadWriteOnce
        resources:
          requests:
            storage: 4619Gi
        storageClassName: local-storage
  - name: warm
    config:
      node.attr.data: warm
      node.data: true
      node.ingest: false
      node.master: false
    count: 5
    podTemplate:
      spec:
        affinity:
          nodeAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
              nodeSelectorTerms:
              - matchExpressions:
                - key: storage
                  operator: In
                  values:
                  - high
        containers:
        - env:
          - name: ES_JAVA_OPTS
            value: -Xms27g -Xmx27g -Xlog:gc*,gc+age=trace,safepoint:file=logs/gc.log:utctime,pid,tags:filecount=2,filesize=10m
          name: elasticsearch
          resources:
            limits:
              cpu: 7
              ephemeral-storage: 400Mi
              memory: 55Gi
            requests:
              cpu: 7
              ephemeral-storage: 400Mi
              memory: 55Gi
        initContainers:
        - command:
          - sh
          - -c
          - |
            bin/elasticsearch-plugin install --batch mapper-size repository-s3
          name: install-plugins
        priorityClassName: elasticsearch
        serviceAccountName: alerts-es
    volumeClaimTemplates:
    - metadata:
        name: elasticsearch-data
      spec:
        accessModes:
        - ReadWriteOnce
        resources:
          requests:
            storage: 11089Gi
        storageClassName: local-storage
  secureSettings:
  - secretName: alerts-es-backup
  version: 7.7.0
Things worth noting:
- added a priorityClass for elasticsearch so other pods don't compete for space on the special nodes
 
- The pod gets killed for using over 100Gi of ephemeral storage so I bumped it to 400Gi
 
- I'm using secureSettings for the s3 access key instead of allowing all pods to hit that s3 bucket
 
- haven't added the dedicated masters yet because I haven't found any documentation around their storage requirements or recommended sizing