Nodes are bootlooping when attempting to start cluster

This issue cropped up over the weekend. I'm running a 4 instance setup of ECE and when I try to spin up a cluster on 3 of these instances, I get the following error:

Unexpected error during step: [forced-cluster-reboot]: [no.found.constructor.steps.waiting.ServerBootloopingException: Instance is bootlooping [ElasticsearchInstance(ElasticsearchCluster(b1f1c59858344ab0a38f223013a578ee),instance-0000000006)]]

I'm not sure as to the reason of this as it worked just 2 days ago, but it's preventing me from doing anything at all to the cluster.

Hi Brian,
Although there can be many reasons and I would suggest you to check the cluster's logs to find out the root cause (you can find logs in logging-and-metrics cluster), usually it happens because one of the following reasons:

  • insufficient memory allocated to a node.
  • insufficient disk quota.

To solve these two issues, this API call can be useful

curl -u root -X PUT \
  'https://$COORDINATOR_HOST:12443/api/v1/clusters/elasticsearch/$CLUSTER_ID/instances/$COMMA_SEPARATED_BOOTLOOPING_INSTANCES_IDS/settings?restart_after_update=true' \
  -d '{
  "instance_capacity": 8192

The command above overrides memory quota for a particular instance (or instances). But it does not change cluster plan. It means that after you apply a plan to the cluster, the settings will be gone.

After the cluster starts and gets synced, I recommend increasing memory quota for the cluster by changing its capacity via UI.

As it turns out the issue was that every instance of ECE had run out of storage on the root filesystem, despite there being hundreds of gigabytes remaining in /mnt/data. It caused the error i mentioned in this thread along with many others, even after I expanded the root storage. I ended up having to reinstall ECE because a cluster would not change its configuration, start up, or be deleted.

