Service crashed | Disk Utilization was 100%

Hi Team,

We have 14 nodes of elasticsearch(v8.17) cluster(3 master+6 hot+5 warm) running on prem. Today our cluster was on red state and i have observed that one of the warm nodes was stopped because of disk was 100% .
In each warm nodes the disk size is 15Tb. Daily our warm nodes disk utilization is 97 to 98% and we are retaining 105 days . Today i have changed the retention policy from 105 to 90 days but one warm nodes was not starting , so i manually delete directory from path.data . Although it is not advisable.

Can we do set any settings like disk never reaches to 98% if it is on 97% although the indices should go on red but our service should not stop. In elastic we have disk watermark settings we can set at cluster level. We are using default settings but we don't want our service should stop .

1 Like

I thought cluster.routing.allocation.disk.watermark.flood_stage defaulted to 95%? Did you change this? Is anything else writing to filesystem on the disk partition pointed to by path.data, except elasticsearch?

Well, I dont want to judge, you were in difficult spot, but ... that was IMHO likely not the best idea. Sometimes you need take the pain and fix the root cause, which seems to be how close to the edge you are living in a normal day. Just consider if one of your warm nodes actually died - hardware issue/sysadmin error/.... ?

Did you make any changes to the watermark settings? You said that you are using default settings, but the default watermark settings triggers would not allow you to have warm nodes usage on 97%, at least not from the Elasticsearch service.

The default watermark settings are 80% for low stage, 90% for high stage and 95% for flood stage, so you should have started having issues when it reached 95% of usage.

The watermark are used for this, but you need to monitor your cluster and act as soon you starting getting warnings about watermark levels being reached.

You also need to monitor the underlying infrastructure and act earlier, there is no much else to do.