ECK Elasticsearch at 100% disk usage

Hi Everyone,

We've been having some issues with our Elasticsearch cluster hitting 100% disk usage on one or more nodes.
It is a three node cluster with two master/data nodes and one voting-only node.

The nodes have a PV created by the Local Volume Provisioner storage class.

[root@k8s02 storage]# df -h /var/elasticsearch/storage/
Filesystem                     Size  Used Avail Use% Mounted on
/dev/mapper/almalinux-elastic 1019G  677G  342G  67% /var/elasticsearch/storage

[root@k8s02 storage]# kubectl get pv
NAME                CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                                            STORAGECLASS    REASON   AGE
local-pv-7db3cb5    969Gi      RWO            Delete           Bound    default/elasticsearch-data-elastic-es-master-0   local-storage            170d
local-pv-9a836200   969Gi      RWO            Delete           Bound    default/elasticsearch-data-elastic-es-master-1   local-storage            170d

We have about 10+ clusters in production. In every other instance, when hitting the flood stage all the indices get the read_only_allow_delete option set accordingly, except for this one.

There are no notable logs on any of the ELK stack components, nor on the OS of the nodes running Kubernetes. This has happened to this cluster a few times already, forcing us to increase the disk space each time.

Does anyone have any clue on what the reason for that could be?

Here are all the cluster settings the node has configured:

[2025-01-03T09:36:57,957][INFO ][o.e.c.s.ClusterSettings  ] [elastic-es-master-1] updating [cluster.routing.allocation.node_concurrent_incoming_recoveries] from [2] to [4]
[2025-01-03T09:36:57,957][INFO ][o.e.c.s.ClusterSettings  ] [elastic-es-master-1] updating [cluster.routing.allocation.disk.watermark.high] from [90%] to [94%]
[2025-01-03T09:36:57,958][INFO ][o.e.c.s.ClusterSettings  ] [elastic-es-master-1] updating [cluster.routing.allocation.disk.watermark.flood_stage] from [95%] to [97%]
[2025-01-03T09:36:57,958][INFO ][o.e.c.s.ClusterSettings  ] [elastic-es-master-1] updating [cluster.routing.allocation.disk.watermark.high.max_headroom] from [150GB] to [-1]
[2025-01-03T09:36:57,958][INFO ][o.e.c.s.ClusterSettings  ] [elastic-es-master-1] updating [cluster.routing.allocation.disk.watermark.flood_stage.max_headroom] from [100GB] to [-1]
[2025-01-03T09:36:57,958][INFO ][o.e.c.s.ClusterSettings  ] [elastic-es-master-1] updating [cluster.max_shards_per_node] from [1000] to [3000]
[2025-01-03T09:36:57,959][INFO ][o.e.c.s.ClusterSettings  ] [elastic-es-master-1] updating [cluster.routing.allocation.disk.watermark.low] from [85%] to [92%]
[2025-01-03T09:36:57,959][INFO ][o.e.c.s.ClusterSettings  ] [elastic-es-master-1] updating [cluster.routing.allocation.disk.watermark.low.max_headroom] from [200GB] to [-1]
[2025-01-03T09:36:57,959][INFO ][o.e.c.s.ClusterSettings  ] [elastic-es-master-1] updating [cluster.routing.allocation.disk.watermark.high] from [90%] to [94%]
[2025-01-03T09:36:57,959][INFO ][o.e.c.s.ClusterSettings  ] [elastic-es-master-1] updating [cluster.routing.allocation.disk.watermark.high.max_headroom] from [150GB] to [-1]
[2025-01-03T09:36:57,959][INFO ][o.e.c.s.ClusterSettings  ] [elastic-es-master-1] updating [cluster.routing.allocation.disk.watermark.flood_stage] from [95%] to [97%]
[2025-01-03T09:36:57,959][INFO ][o.e.c.s.ClusterSettings  ] [elastic-es-master-1] updating [cluster.routing.allocation.disk.watermark.flood_stage.max_headroom] from [100GB] to [-1]
[2025-01-03T09:36:57,960][INFO ][o.e.c.s.ClusterSettings  ] [elastic-es-master-1] updating [cluster.routing.allocation.disk.watermark.low] from [85%] to [92%]
[2025-01-03T09:36:57,960][INFO ][o.e.c.s.ClusterSettings  ] [elastic-es-master-1] updating [cluster.routing.allocation.disk.watermark.low.max_headroom] from [200GB] to [-1]
[2025-01-03T09:36:57,960][INFO ][o.e.c.s.ClusterSettings  ] [elastic-es-master-1] updating [cluster.routing.allocation.disk.watermark.high] from [90%] to [94%]
[2025-01-03T09:36:57,960][INFO ][o.e.c.s.ClusterSettings  ] [elastic-es-master-1] updating [cluster.routing.allocation.disk.watermark.high.max_headroom] from [150GB] to [-1]
[2025-01-03T09:36:57,960][INFO ][o.e.c.s.ClusterSettings  ] [elastic-es-master-1] updating [cluster.routing.allocation.disk.watermark.flood_stage] from [95%] to [97%]
[2025-01-03T09:36:57,960][INFO ][o.e.c.s.ClusterSettings  ] [elastic-es-master-1] updating [cluster.routing.allocation.disk.watermark.flood_stage.max_headroom] from [100GB] to [-1]

Any help is greatly appreciated!

Cheers,
Luka

Can you share the output of the following API call result?

GET _cat/allocation?v

@lduvnjak any update here?

If you resolved your issue, and I hope you did, always helps the community to share how you did it!

Sorry for the late reply, for some reason I didn't get an email that someone replied in the thread.

Here's the output:

shards shards.undesired write_load.forecast disk.indices.forecast disk.indices disk.used disk.avail disk.total disk.percent host          ip            node                node.role
   281                0                 0.0               794.6gb      764.6gb   772.7gb    245.2gb     1018gb           75 10.233.73.54  10.233.73.54  elastic-es-master-0 cdfhilmrstw
   281                0                 0.0               794.6gb      761.6gb   768.9gb      249gb     1018gb           75 10.233.112.62 10.233.112.62 elastic-es-master-1 cdfhilmrstw

The cluster is working fine ATM since we forcefully restarted the stuck pod.
Also, we manually filled up the disk with fallocate and it detected it just fine, triggering the low, high, and flood stages accordingly. So we're not really sure when or why this happens.

All looks good. Use some alerting system like AutoOps to make sure your cluster is healthy and the disks are not full.