ECK Elasticsearch at 100% disk usage

lduvnjak · January 3, 2025, 9:47am

Hi Everyone,

We've been having some issues with our Elasticsearch cluster hitting 100% disk usage on one or more nodes.
It is a three node cluster with two master/data nodes and one voting-only node.

The nodes have a PV created by the Local Volume Provisioner storage class.

[root@k8s02 storage]# df -h /var/elasticsearch/storage/
Filesystem                     Size  Used Avail Use% Mounted on
/dev/mapper/almalinux-elastic 1019G  677G  342G  67% /var/elasticsearch/storage

[root@k8s02 storage]# kubectl get pv
NAME                CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                                            STORAGECLASS    REASON   AGE
local-pv-7db3cb5    969Gi      RWO            Delete           Bound    default/elasticsearch-data-elastic-es-master-0   local-storage            170d
local-pv-9a836200   969Gi      RWO            Delete           Bound    default/elasticsearch-data-elastic-es-master-1   local-storage            170d

We have about 10+ clusters in production. In every other instance, when hitting the flood stage all the indices get the read_only_allow_delete option set accordingly, except for this one.

There are no notable logs on any of the ELK stack components, nor on the OS of the nodes running Kubernetes. This has happened to this cluster a few times already, forcing us to increase the disk space each time.

Does anyone have any clue on what the reason for that could be?

Here are all the cluster settings the node has configured:

[2025-01-03T09:36:57,957][INFO ][o.e.c.s.ClusterSettings  ] [elastic-es-master-1] updating [cluster.routing.allocation.node_concurrent_incoming_recoveries] from [2] to [4]
[2025-01-03T09:36:57,957][INFO ][o.e.c.s.ClusterSettings  ] [elastic-es-master-1] updating [cluster.routing.allocation.disk.watermark.high] from [90%] to [94%]
[2025-01-03T09:36:57,958][INFO ][o.e.c.s.ClusterSettings  ] [elastic-es-master-1] updating [cluster.routing.allocation.disk.watermark.flood_stage] from [95%] to [97%]
[2025-01-03T09:36:57,958][INFO ][o.e.c.s.ClusterSettings  ] [elastic-es-master-1] updating [cluster.routing.allocation.disk.watermark.high.max_headroom] from [150GB] to [-1]
[2025-01-03T09:36:57,958][INFO ][o.e.c.s.ClusterSettings  ] [elastic-es-master-1] updating [cluster.routing.allocation.disk.watermark.flood_stage.max_headroom] from [100GB] to [-1]
[2025-01-03T09:36:57,958][INFO ][o.e.c.s.ClusterSettings  ] [elastic-es-master-1] updating [cluster.max_shards_per_node] from [1000] to [3000]
[2025-01-03T09:36:57,959][INFO ][o.e.c.s.ClusterSettings  ] [elastic-es-master-1] updating [cluster.routing.allocation.disk.watermark.low] from [85%] to [92%]
[2025-01-03T09:36:57,959][INFO ][o.e.c.s.ClusterSettings  ] [elastic-es-master-1] updating [cluster.routing.allocation.disk.watermark.low.max_headroom] from [200GB] to [-1]
[2025-01-03T09:36:57,959][INFO ][o.e.c.s.ClusterSettings  ] [elastic-es-master-1] updating [cluster.routing.allocation.disk.watermark.high] from [90%] to [94%]
[2025-01-03T09:36:57,959][INFO ][o.e.c.s.ClusterSettings  ] [elastic-es-master-1] updating [cluster.routing.allocation.disk.watermark.high.max_headroom] from [150GB] to [-1]
[2025-01-03T09:36:57,959][INFO ][o.e.c.s.ClusterSettings  ] [elastic-es-master-1] updating [cluster.routing.allocation.disk.watermark.flood_stage] from [95%] to [97%]
[2025-01-03T09:36:57,959][INFO ][o.e.c.s.ClusterSettings  ] [elastic-es-master-1] updating [cluster.routing.allocation.disk.watermark.flood_stage.max_headroom] from [100GB] to [-1]
[2025-01-03T09:36:57,960][INFO ][o.e.c.s.ClusterSettings  ] [elastic-es-master-1] updating [cluster.routing.allocation.disk.watermark.low] from [85%] to [92%]
[2025-01-03T09:36:57,960][INFO ][o.e.c.s.ClusterSettings  ] [elastic-es-master-1] updating [cluster.routing.allocation.disk.watermark.low.max_headroom] from [200GB] to [-1]
[2025-01-03T09:36:57,960][INFO ][o.e.c.s.ClusterSettings  ] [elastic-es-master-1] updating [cluster.routing.allocation.disk.watermark.high] from [90%] to [94%]
[2025-01-03T09:36:57,960][INFO ][o.e.c.s.ClusterSettings  ] [elastic-es-master-1] updating [cluster.routing.allocation.disk.watermark.high.max_headroom] from [150GB] to [-1]
[2025-01-03T09:36:57,960][INFO ][o.e.c.s.ClusterSettings  ] [elastic-es-master-1] updating [cluster.routing.allocation.disk.watermark.flood_stage] from [95%] to [97%]
[2025-01-03T09:36:57,960][INFO ][o.e.c.s.ClusterSettings  ] [elastic-es-master-1] updating [cluster.routing.allocation.disk.watermark.flood_stage.max_headroom] from [100GB] to [-1]

Any help is greatly appreciated!

Cheers,
Luka

Musab_Dogan · January 5, 2025, 3:30pm

Can you share the output of the following API call result?

GET _cat/allocation?v

RainTown · January 11, 2025, 12:25am

@lduvnjak any update here?

If you resolved your issue, and I hope you did, always helps the community to share how you did it!

lduvnjak · January 14, 2025, 1:31pm

Sorry for the late reply, for some reason I didn't get an email that someone replied in the thread.

Here's the output:

shards shards.undesired write_load.forecast disk.indices.forecast disk.indices disk.used disk.avail disk.total disk.percent host          ip            node                node.role
   281                0                 0.0               794.6gb      764.6gb   772.7gb    245.2gb     1018gb           75 10.233.73.54  10.233.73.54  elastic-es-master-0 cdfhilmrstw
   281                0                 0.0               794.6gb      761.6gb   768.9gb      249gb     1018gb           75 10.233.112.62 10.233.112.62 elastic-es-master-1 cdfhilmrstw

The cluster is working fine ATM since we forcefully restarted the stuck pod.
Also, we manually filled up the disk with fallocate and it detected it just fine, triggering the low, high, and flood stages accordingly. So we're not really sure when or why this happens.

Musab_Dogan · January 14, 2025, 1:44pm

All looks good. Use some alerting system like AutoOps to make sure your cluster is healthy and the disks are not full.

RainTown · January 15, 2025, 11:27pm

There’s really not enough information here to diagnose why sometimes things don’t work as you expect.

If I’ve understood correctly, if you create 100% disk usage outside Elasticsearch, eg using fallioate, then the alarms are triggered both as expected and as on other clusters you manage.

But on this cluster alone, if Elasticsearch itself eats up all the available disk space, you don’t get the log entries for low/high/flood.

You could increase the specific classes logging to DEBUG or TRACE , though someone else would need to advise the precise names to use.

lduvnjak · January 16, 2025, 8:16am

Agreed, it's not really reproducible manually so it's almost impossible to troubleshoot.
We're removing this cluster anyways so let's hope it's a niche bug on this specific installation and won't happen again.

Thanks for the help!

Topic		Replies	Views
Why disk usage of more than 90%, or even up to 100%？ Elasticsearch	1	1025	March 11, 2017
What if my storage is full? Elasticsearch	7	2683	September 1, 2017
Low disk watermark [15%] exceeded on Elasticsearch	13	5366	July 5, 2017
Unexpected high disk usage Elasticsearch	1	338	July 3, 2018
Elastic Search disk/storage usage identification Elasticsearch	3	354	February 5, 2019

ECK Elasticsearch at 100% disk usage

Related topics