I don't want to disable disk threshold alerts / warnings but I'm curious why Elasticsearch would think one of the disks is running low on space when the node with largest utilization is still under 50% disk usage? Can something else trigger this?
[2022-07-18T01:46:46,076][WARN ][o.e.c.r.a.DiskThresholdMonitor] [hx90-3] flood stage disk watermark [95%] exceeded on [9dIOER48QDOB1NBks8CNeA][hx90-2][/var/lib/elasticsearch] free: 56.3gb[1.5%], all indices on this node will be marked read-only
[2022-07-18T01:47:46,085][WARN ][o.e.c.r.a.DiskThresholdMonitor] [hx90-3] flood stage disk watermark [95%] exceeded on [9dIOER48QDOB1NBks8CNeA][hx90-2][/var/lib/elasticsearch] free: 53.2gb[1.4%], all indices on this node will be marked read-only
What's really weird is that it appears to be complaining that two different nodes, each with plenty of space, are down to the same amount of low disk space. I'm going to check those two nodes to see if I can find anything additional. I've never encountered this issue before (Well I have, when disk space was actually getting low.
I have more info. For the node hx90-2, it is set up to use multiple data paths. One of the data paths is indeed out of space. However, I thought with multiple data paths, it would combine all of them into one large pool? Please see below:
So Elasticsearch is right in seeing that one of the data paths is almost full, but then my assumptions about how multiple data paths is used must be incorrect because I thought they would be used as one large pool. However internally, I guess it makes sense if shard data is already on one path and that path runs out of space (but shouldn't it just move the shard over to a path with more space?)
If multiple data paths does not work like this, then the information from /_cat/allocation might need to include more information for nodes with multiple data paths, like showing a breakdown of each data path and what the allocation is per data path -- because as of right now, it is showing the aggregate of the sum of all data paths for space left across all of them. That means someone could look at allocation and think the cluster is fine but there could be a data path dangerously low on space.
Side note: Is using multiple data paths frowned upon or more of a sysadmin PITA? Should I be using LVM or some other RAID type structure for a node with multiple disks instead of using multiple data paths?
Yes. Multiple data paths doesn't work as you expect, and it's deprecated. The recommended setup is to use a single data path per node, possibly running multiple nodes per host and/or combining volumes together using something like RAID.