Recovering from crashlooping masters due to exhausted disk capacity (2/3)

John_Nason · December 30, 2020, 2:50am

My ECK topology is 3 master nodes, and 4 data/ingest nodes (as specified in the config: stanza).
The data nodes all have large volumes attached.
Without knowing better, I left the masters at the default, which happens to be a 1GB disk.

I found that two of the master nodes are now crash-looping due to disk capacity as evident in the pod logs:

"org.elasticsearch.bootstrap.StartupException: ElasticsearchException[failed to bind service]; nested: WriteStateException[failed to write state to the first location tmp file /usr/share/elasticsearch/data/nodes/0/node-2.st.tmp]; nested: IOException[No space left on device];",

Now, I left with no quorum and not sure how to resolve this issue. I've tried to follow the process for increasing volume size, which I've done with success to data nodes in the past. The process I followed is to:

rename the nodeSet for the config where node.master: true
add the volumeClaimTemplate

What I've found however is that no new pods are being created.
I do have a new statefulset created matching the new name, but it is set to have 0 replicas (should be 3 replicas).

Is there any path to recovery out of this that will let me continue to use the data nodes? I was thinking something like binding a container to the volume to try and eek out enough clean-up on the master(s) to get it going again, but the volumes are bound to the crashlooping pods with RWO mode.

Any suggestions appreciated. Thanks!

Topic		Replies	Views
Master nodes lost Elastic Cloud on Kubernetes (ECK)	3	2048	November 4, 2022
ECK managed cluster and elasticsearch-node Elastic Cloud on Kubernetes (ECK)	2	487	November 4, 2022
Duplicate data in nodes folder Elasticsearch	5	1512	July 6, 2017
After kubernetes node restart, data pods couldn't join the cluster again Elastic Cloud on Kubernetes (ECK)	2	1782	November 4, 2022
Stuck in Initializing ES 2.3.0 Elasticsearch	4	912	July 5, 2017

Recovering from crashlooping masters due to exhausted disk capacity (2/3)

Related topics