Hi all,
We have an ECK cluster v8.13.2 running on Kubernetes v1.29.2.
The cluster has dedicated hot and warm data nodes. We have completed the load test and are now at the stage where we want to test recovery from failure. After pulling the cord from a kubernetes nodes, 3 pods were unavailable
- logstgash
- elasticsearch-master
- elasticsearch-data-hot
After starting the node, logstash recovered but both elasticsearch pods went into a crashLoopBackoff. From what we can tell, the pods are not restarted and the container that is in crashloop on both pods is elastic-internal-init-filesystem
. The following are the logs we get on the failing container
Starting init script
Copying /usr/share/elasticsearch/config/* to /mnt/elastic-internal/elasticsearch-config-local/
'/usr/share/elasticsearch/config/elasticsearch-plugins.example.yml' -> '/mnt/elastic-internal/elasticsearch-config-local/elasticsearch-plugins.example.yml'
'/usr/share/elasticsearch/config/elasticsearch.yml' -> '/mnt/elastic-internal/elasticsearch-config-local/elasticsearch.yml'
'/usr/share/elasticsearch/config/http-certs/..2024_05_27_10_44_08.1291635209/ca.crt' -> '/mnt/elastic-internal/elasticsearch-config-local/http-certs/..2024_05_27_10_44_08.1291635209/ca.crt'
'/usr/share/elasticsearch/config/http-certs/..2024_05_27_10_44_08.1291635209/tls.crt' -> '/mnt/elastic-internal/elasticsearch-config-local/http-certs/..2024_05_27_10_44_08.1291635209/tls.crt'
'/usr/share/elasticsearch/config/http-certs/..2024_05_27_10_44_08.1291635209/tls.key' -> '/mnt/elastic-internal/elasticsearch-config-local/http-certs/..2024_05_27_10_44_08.1291635209/tls.key'
removed '/mnt/elastic-internal/elasticsearch-config-local/http-certs/..data'
'/usr/share/elasticsearch/config/http-certs/..data' -> '/mnt/elastic-internal/elasticsearch-config-local/http-certs/..data'
removed '/mnt/elastic-internal/elasticsearch-config-local/http-certs/ca.crt'
'/usr/share/elasticsearch/config/http-certs/ca.crt' -> '/mnt/elastic-internal/elasticsearch-config-local/http-certs/ca.crt'
removed '/mnt/elastic-internal/elasticsearch-config-local/http-certs/tls.crt'
'/usr/share/elasticsearch/config/http-certs/tls.crt' -> '/mnt/elastic-internal/elasticsearch-config-local/http-certs/tls.crt'
removed '/mnt/elastic-internal/elasticsearch-config-local/http-certs/tls.key'
'/usr/share/elasticsearch/config/http-certs/tls.key' -> '/mnt/elastic-internal/elasticsearch-config-local/http-certs/tls.key'
'/usr/share/elasticsearch/config/jvm.options' -> '/mnt/elastic-internal/elasticsearch-config-local/jvm.options'
'/usr/share/elasticsearch/config/log4j2.file.properties' -> '/mnt/elastic-internal/elasticsearch-config-local/log4j2.file.properties'
'/usr/share/elasticsearch/config/log4j2.properties' -> '/mnt/elastic-internal/elasticsearch-config-local/log4j2.properties'
cp: preserving times for '/mnt/elastic-internal/elasticsearch-config-local/log4j2.properties': Operation not permitted
'/usr/share/elasticsearch/config/operator/..2024_05_27_10_44_08.3060476595/settings.json' -> '/mnt/elastic-internal/elasticsearch-config-local/operator/..2024_05_27_10_44_08.3060476595/settings.json'
removed '/mnt/elastic-internal/elasticsearch-config-local/operator/..data'
'/usr/share/elasticsearch/config/operator/..data' -> '/mnt/elastic-internal/elasticsearch-config-local/operator/..data'
removed '/mnt/elastic-internal/elasticsearch-config-local/operator/settings.json'
'/usr/share/elasticsearch/config/operator/settings.json' -> '/mnt/elastic-internal/elasticsearch-config-local/operator/settings.json'
'/usr/share/elasticsearch/config/role_mapping.yml' -> '/mnt/elastic-internal/elasticsearch-config-local/role_mapping.yml'
'/usr/share/elasticsearch/config/roles.yml' -> '/mnt/elastic-internal/elasticsearch-config-local/roles.yml'
'/usr/share/elasticsearch/config/transport-remote-certs/..2024_05_27_10_44_08.2100192238/ca.crt' -> '/mnt/elastic-internal/elasticsearch-config-local/transport-remote-certs/..2024_05_27_10_44_08.2100192238/ca.crt'
removed '/mnt/elastic-internal/elasticsearch-config-local/transport-remote-certs/..data'
'/usr/share/elasticsearch/config/transport-remote-certs/..data' -> '/mnt/elastic-internal/elasticsearch-config-local/transport-remote-certs/..data'
removed '/mnt/elastic-internal/elasticsearch-config-local/transport-remote-certs/ca.crt'
'/usr/share/elasticsearch/config/transport-remote-certs/ca.crt' -> '/mnt/elastic-internal/elasticsearch-config-local/transport-remote-certs/ca.crt'
'/usr/share/elasticsearch/config/users' -> '/mnt/elastic-internal/elasticsearch-config-local/users'
'/usr/share/elasticsearch/config/users_roles' -> '/mnt/elastic-internal/elasticsearch-config-local/users_roles'
If we delete the pods, they restart properly and the elasticsearch cluster goes back into a healthy state.
Has anyone experienced something similar? Are we doing something wrong?
The following is our elasticsearch poc config
---
apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
name: elasticsearch
namespace: elk-system
labels:
app: elasticsearch
environment: dev
spec:
version: 8.13.2
volumeClaimDeletePolicy: DeleteOnScaledownOnly
auth:
roles:
- secretName: logstash-kafka-role
monitoring:
metrics:
elasticsearchRefs:
- name: elasticsearch
logs:
elasticsearchRefs:
- name: elasticsearch
nodeSets:
- name: cluster
count: 3
config:
cluster.routing.allocation.disk.watermark.low: "98%"
cluster.routing.allocation.disk.watermark.high: "99%"
cluster.routing.allocation.disk.watermark.flood_stage: "99%"
node.roles: ["master", "ingest"]
xpack.ml.enabled: false
ingest.geoip.downloader.enabled: false
podTemplate:
metadata:
labels:
app: elasticsearch
environment: dev
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: elasticsearch
environment: dev
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
preference:
matchExpressions:
- key: node-role.kubernetes.io/c8m32
operator: Exists
values: []
tolerations:
- key: c8m32
effect: NoSchedule
operator: Exists
containers:
- name: elasticsearch
env:
- name: ES_JAVA_OPTS
value: -Xms3g -Xmx3g
resources:
requests:
memory: 4500Mi
cpu: 1
limits:
memory: 4500Mi
cpu: 1
securityContext:
runAsUser: 2000
runAsGroup: 3000
- name: data-hot
count: 2
config:
cluster.routing.allocation.disk.watermark.low: "98%"
cluster.routing.allocation.disk.watermark.high: "99%"
cluster.routing.allocation.disk.watermark.flood_stage: "99%"
node.roles: ["data_hot", "data_content"]
xpack.ml.enabled: false
ingest.geoip.downloader.enabled: false
podTemplate:
metadata:
labels:
app: elasticsearch
environment: dev
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: elasticsearch
environment: dev
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
preference:
matchExpressions:
- key: node-role.kubernetes.io/c8m32
operator: Exists
values: []
tolerations:
- key: c8m32
effect: NoSchedule
operator: Exists
containers:
- name: elasticsearch
env:
- name: ES_JAVA_OPTS
value: -Xms12g -Xmx12g
resources:
requests:
memory: 16Gi
cpu: 8
limits:
memory: 16Gi
cpu: 8
securityContext:
runAsUser: 2000
runAsGroup: 3000
volumeClaimTemplates:
- apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: elasticsearch-data
labels:
app: elasticsearch
environment: dev
spec:
storageClassName: sc-ebs-gp3-xfs
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 300Gi
- name: data-warm
count: 2
config:
cluster.routing.allocation.disk.watermark.low: "98%"
cluster.routing.allocation.disk.watermark.high: "99%"
cluster.routing.allocation.disk.watermark.flood_stage: "99%"
node.roles: ["data_warm"]
xpack.ml.enabled: false
ingest.geoip.downloader.enabled: false
podTemplate:
metadata:
labels:
app: elasticsearch
environment: dev
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: elasticsearch
environment: dev
containers:
- name: elasticsearch
env:
- name: ES_JAVA_OPTS
value: -Xms5g -Xmx5g
resources:
requests:
memory: 6Gi
cpu: 2
limits:
memory: 6Gi
cpu: 2
securityContext:
runAsUser: 2000
runAsGroup: 3000
volumeClaimTemplates:
- apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: elasticsearch-data
labels:
app: elasticsearch
environment: dev
spec:
storageClassName: sc-ebs-gp3-xfs
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 600Gi
Thank you for any help/advice/doc you can give me.