My ECK topology is 3 master nodes, and 4 data/ingest nodes (as specified in the config:
stanza).
The data nodes all have large volumes attached.
Without knowing better, I left the masters at the default, which happens to be a 1GB disk.
I found that two of the master nodes are now crash-looping due to disk capacity as evident in the pod logs:
"org.elasticsearch.bootstrap.StartupException: ElasticsearchException[failed to bind service]; nested: WriteStateException[failed to write state to the first location tmp file /usr/share/elasticsearch/data/nodes/0/node-2.st.tmp]; nested: IOException[No space left on device];",
Now, I left with no quorum and not sure how to resolve this issue. I've tried to follow the process for increasing volume size, which I've done with success to data nodes in the past. The process I followed is to:
- rename the nodeSet for the config where
node.master: true
- add the volumeClaimTemplate
What I've found however is that no new pods are being created.
I do have a new statefulset created matching the new name, but it is set to have 0 replicas (should be 3 replicas).
Is there any path to recovery out of this that will let me continue to use the data nodes? I was thinking something like binding a container to the volume to try and eek out enough clean-up on the master(s) to get it going again, but the volumes are bound to the crashlooping pods with RWO mode.
Any suggestions appreciated. Thanks!