I have a cluster (7.16.2) running on k8s, with multiple DataNodes, dedicated ClientNodes, and Dedicated MasterNodes.
Some of the DataNodes are "hot" data nodes, and some of them are "warm".
I have daily indices created and all of them are managed using index templates and index lifecycles (hot->warm->delete).
My watermarks are configured as default:
cluster.routing.allocation.disk.watermark.low=85%
cluster.routing.allocation.disk.watermark.high=90%
cluster.routing.allocation.disk.watermark.flood_stage=95%
Some of my indices have replica shards and some do not.
One of the "hot" DataNodes (which is a pod) got to 100% disk usage and couldn't rejoin the cluster.
How can that happen?
Shouldn't the flood_stage parameter mark the indices with shards on the node as "ReadOnly" and stop all writes to the disk?
Is there a configuration I somehow could have missed or misconfigured?
How much disk space do your nodes have? Cause yes, it should stop writing, but there's always a bit of wiggle room due to things like merges or reallocation that may happen. So if your node has 5GB at 95% and your indices are 10's of GB in size then that may explain it. But I am making a guess there without more info.
Hi @warkolm
Thanks for the welcome, and for the quick response.
My "hot" nodes have 7TB of disk each (persisstent volume connected to the pod).
So 5% of that is 350GB..
Hi @Rios
Thanks for the response.
The FS that was filled up is used only for ES data since it's a persistent volume being mounted by the pod directory to the data directory.
The mount point is:
/usr/share/elasticsearch/data
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.