The local storage provisioner approach is working for me after a bit of experimentation.
userdata script to initialize local volumes
My current userdata script handles i3, i3en, and d2 instance types with one or more local volumes. It uses the nvme tool for nvme devices or lsblk for others, and uses the AWS block device mapping API to check which are ephemeral. Single devices are mounted in /mnt/data directly, multiple devices are added to a raid 0 array at /dev/md0, then mounted at /mnt/data
#!/bin/bash
set -o xtrace
echo '* - nofile 65536' >> /etc/security/limits.conf
echo 'root - nofile 65536' >> /etc/security/limits.conf
echo "session required pam_limits.so" >> /etc/pam.d/common-session
echo 'vm.max_map_count=262144' >> /etc/sysctl.conf
# Identify the ephemeral volumes using either the nvme command for i3 disks or lsblk and the AWS API to query block device mappings
# https://aws.amazon.com/premiumsupport/knowledge-center/ec2-linux-instance-store-volumes/
if [[ -e /dev/nvme0n1 ]]; then
yum install nvme-cli -y
instance_stores=$(nvme list | awk '/Instance Storage/ {print $1}')
echo $instance_stores
else
OSDEVICE=$(sudo lsblk -o NAME -n | grep -v '[[:digit:]]' | sed "s/^sd/xvd/g")
BDMURL="http://169.254.169.254/latest/meta-data/block-device-mapping/"
instance_stores=$(
for bd in $(curl -s $BDMURL); do
MAPDEVICE=$(curl -s $BDMURL/$bd/ | sed "s/^sd/xvd/g");
if grep -wq $MAPDEVICE <<< "$OSDEVICE"; then
echo $MAPDEVICE
fi
done
)
echo $instance_stores
fi
# If one volume is found, mount it at /mnt/data
# If multple, create a raid0 array as /dev/md0 and mount it as /mnt/data
# A local-storage-provisioner using /mnt as the hostPath will pick up either of these
if [[ -n "$instance_stores" ]]; then
count=$(echo $instance_stores | wc -w)
if [[ $count -eq 1 ]]; then
mkdir -p /mnt/data
mkfs.ext4 $instance_stores
echo $instance_stores /mnt/data ext4 defaults,noatime 0 2 >> /etc/fstab
mount -a
elif [[ $count -gt 1 ]]; then
yum install mdadm -y
mkdir -p /mnt/data
mdadm --create --verbose --level=0 /dev/md0 --name=DATA --raid-devices=$count $instance_stores
mdadm --wait /dev/md0
mkfs.ext4 /dev/md0
mdadm --detail --scan >> /etc/mdadm.conf
echo /dev/md0 /mnt/data ext4 defaults,noatime 0 2 >> /etc/fstab
mount -a
fi
fi
/etc/eks/bootstrap.sh --apiserver-endpoint '${var.eks_endpoint}' --b64-cluster-ca '${var.eks_ca_data}' --kubelet-extra-args '--node-labels=${var.node_labels}' '${var.cluster_name}'
Node labels
- instanceVolume=present
- storage={high,fast}
Deploy local-storage-provisioner
I cloned the helm chart from https://github.com/kubernetes-sigs/sig-storage-local-static-provisioner/tree/master/helm, pushed it to my local repo, then used these values for the chart. Note that the hostDir in /mnt, not /mnt/data - the local volume provisioner looks for directories under the specified path, and expects each child directory to be a mount point. Using /mnt/data will report an error since /mnt/data/lost+found exists under it, while also not finding /mnt/data as a local volume.
common:
rbac:
pspEnabled: true
serviceMonitor:
enabled: true
classes:
- name: local-storage
hostDir: "/mnt"
mountDir: "/mnt"
storageClass: false
daemonset:
nodeSelector:
localVolume: present
- hostDir=/mnt, not /mnt/data - the local volume provisioner looks for directories under the specified path, and expects each child directory to be a mount point. Using /mnt/data will report an error since /mnt/data/lost+found exists under it, while also not finding /mnt/data as a local volume.
- nodeSelector applies this provisioner to nodes with label localVolume=present
- storageClass=false because I already defined the storageclass - others might want to let this create it for them
Autoscaling group vs explicit instances vs managed node groups
I prefer to use autoscaling groups that span multiple zones instead of creating ec2 instances explicitly. The cluster-autoscaler isn't aware of the local volume provisioner yet, though there are issues raised to request it. This means that it will never add new workers based on a new elasticsearch node being configured in a nodeSet, so min/max/desired would need to be set manually.
I'd like to use the Managed Node Groups, but these don't support custom userdata scripts or custom security groups yet, so I'm holding off on them for now. Could always use a privileged init-container instead of userdata, though it feels messy.
Elasticsearch nodeSets
The available size isn't exactly what AWS lists, so I checked the amount reported by the local-volume-provisoner and used the nearest Gi for the requested storage.
- i3en.2xlarge reports 4960178446336 bytes, so I used 4619Gi
- i3.2xlarge has 1870043070464 bytes, so I used 1740Gi
- d2.2xlarge has 11906751668224 bytes, so I used 11089Gi
All use storageClass: local-storage
I also have some daemonsets running, so the full instance cpu/mem is not available.
For my environment these allocations worked well (following the advice that half the memory should be heap, half left for os/cache/etc). Note that i3.2xlarge and d2.2xlarge have 61Gi instead of 64:
- i3en.2xlarge: 7 cpu, 60Gi ram, 30g heap, 4619Gi storage
- i3.2xlarge: 7cpu, 55Gi ram, 27g heap, 1740Gi storage
- d2.2xlarge: 7cpu, 55Gi ram, 27g heap, 11089Gi storage
I went with the i3en.2xlarge to have more storage per instance in my hot tier, and to have the "up to 25 Gbit/s" speed instead of "up to 10". Note that this is the max burst speed, with a baseline that's much lower and the instance will be throttled harshly if it exceeds the baseline for too long. This nerfed a few of my nodes when doing the initial backup to s3, and also when replacing a nodeSet and migrating data.
My current nodeSets for hot and warm tier (I generate it from a local helm chart through terraform, so I'm posting the output from kubectl get elasticsearch -o yaml instead of my original):
spec:
nodeSets:
- name: hot2
config:
node.attr.data: hot
node.data: true
node.ingest: true
node.master: true
count: 5
podTemplate:
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: storage
operator: In
values:
- fast
containers:
- env:
- name: ES_JAVA_OPTS
value: -Xms30g -Xmx30g -Xlog:gc*,gc+age=trace,safepoint:file=logs/gc.log:utctime,pid,tags:filecount=2,filesize=10m
name: elasticsearch
resources:
limits:
cpu: 7
ephemeral-storage: 400Mi
memory: 60Gi
requests:
cpu: 7
ephemeral-storage: 400Mi
memory: 60Gi
initContainers:
- command:
- sh
- -c
- |
bin/elasticsearch-plugin install --batch mapper-size repository-s3
name: install-plugins
priorityClassName: elasticsearch
serviceAccountName: alerts-es
volumeClaimTemplates:
- metadata:
name: elasticsearch-data
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 4619Gi
storageClassName: local-storage
- name: warm
config:
node.attr.data: warm
node.data: true
node.ingest: false
node.master: false
count: 5
podTemplate:
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: storage
operator: In
values:
- high
containers:
- env:
- name: ES_JAVA_OPTS
value: -Xms27g -Xmx27g -Xlog:gc*,gc+age=trace,safepoint:file=logs/gc.log:utctime,pid,tags:filecount=2,filesize=10m
name: elasticsearch
resources:
limits:
cpu: 7
ephemeral-storage: 400Mi
memory: 55Gi
requests:
cpu: 7
ephemeral-storage: 400Mi
memory: 55Gi
initContainers:
- command:
- sh
- -c
- |
bin/elasticsearch-plugin install --batch mapper-size repository-s3
name: install-plugins
priorityClassName: elasticsearch
serviceAccountName: alerts-es
volumeClaimTemplates:
- metadata:
name: elasticsearch-data
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 11089Gi
storageClassName: local-storage
secureSettings:
- secretName: alerts-es-backup
version: 7.7.0
Things worth noting:
- added a priorityClass for elasticsearch so other pods don't compete for space on the special nodes
- The pod gets killed for using over 100Gi of ephemeral storage so I bumped it to 400Gi
- I'm using secureSettings for the s3 access key instead of allowing all pods to hit that s3 bucket
- haven't added the dedicated masters yet because I haven't found any documentation around their storage requirements or recommended sizing