I have a 2 node elastic cluster. I noticed that I was not able to add more data to it, and it turned out the filesystem space is over at the flood limit on both machines.
So to try and recover from this I shut down the whole cluster, added another 100G to the master node (the first one). I also added a 3rd node.
Now what's happened is that the 2nd and 3rd node have filled up on space but the 1st node has a lot free . Servers are called elk1,elk2,and elk3. Here is the df output:
elk1:
/dev/mapper/elasticdata-elastic 295G 178G 102G 64% /elastic
elk2:
/dev/sdb1 196G 180G 5.9G 97% /elastic
elk3:
/dev/sdb1 196G 179G 7.7G 96% /elastic
This message is repeating in elk1
[2023-01-25T05:17:29,174][WARN ][o.e.c.r.a.DiskThresholdMonitor] [elk1] high disk watermark [96%] exceeded on [aG0V04CyRmuDsI9J2nkppQ][elk2][/elastic/nodes/0] free: 5.8gb[3%], shards will be relocated away from this node; currently relocating away shards totalling [0] bytes; the node is expected to continue to exceed the high disk watermark when these relocations are complete
When I run the shard explanation API:
{
"index" : "logdata",
"shard" : 0,
"primary" : false,
"current_state" : "unassigned",
"unassigned_info" : {
"reason" : "ALLOCATION_FAILED",
"at" : "2023-01-24T17:26:58.800Z",
"failed_allocation_attempts" : 2,
"details" : "failed shard on node [aG0V04CyRmuDsI9J2nkppQ]: shard failure, reason [merge failed], failure NotSerializableExceptionWrapper[merge_exception: java.io.IOException: No space left on device]; nested: IOException[No space left on device]; ",
"last_allocation_status" : "no_attempt"
},
"can_allocate" : "no",
"allocate_explanation" : "cannot allocate because allocation is not permitted to any of the nodes",
"node_allocation_decisions" : [
{
"node_id" : "DvpTvBh3RCyZvT0zrAfweQ",
"node_name" : "elk3",
"transport_address" : "165.88.130.21:9300",
"node_attributes" : {
"ml.machine_memory" : "8072323072",
"ml.max_open_jobs" : "20",
"xpack.installed" : "true",
"transform.node" : "true"
},
"node_decision" : "no",
"deciders" : [
{
"decider" : "disk_threshold",
"decision" : "NO",
"explanation" : "the node is above the low watermark cluster setting [cluster.routing.allocation.disk.watermark.low=90%], using more disk space than the maximum allowed [90.0%], actual free: [3.8833546828500873%]"
}
]
},
{
"node_id" : "aG0V04CyRmuDsI9J2nkppQ",
"node_name" : "elk2",
"transport_address" : "165.88.130.31:9300",
"node_attributes" : {
"ml.machine_memory" : "8144543744",
"ml.max_open_jobs" : "20",
"xpack.installed" : "true",
"transform.node" : "true"
},
"node_decision" : "no",
"deciders" : [
{
"decider" : "disk_threshold",
"decision" : "NO",
"explanation" : "the node is above the low watermark cluster setting [cluster.routing.allocation.disk.watermark.low=90%], using more disk space than the maximum allowed [90.0%], actual free: [3.0080296262767656%]"
}
]
},
{
"node_id" : "g-HQSgp6TCevlymvuzMRAA",
"node_name" : "elk1",
"transport_address" : "165.88.130.104:9300",
"node_attributes" : {
"ml.machine_memory" : "8139481088",
"xpack.installed" : "true",
"transform.node" : "true",
"ml.max_open_jobs" : "20"
},
"node_decision" : "no",
"deciders" : [
{
"decider" : "same_shard",
"decision" : "NO",
"explanation" : "a copy of this shard is already allocated to this node [[logdata][0], node[g-HQSgp6TCevlymvuzMRAA], [P], s[STARTED], a[id=j4C-sV38QCO6PVowk-Y0iw]]"
}
]
}
]
}
Index status:
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size
yellow open logdata tJ9ZjFluTUWbLGGkAOiZqA 1 1 665303762 57066721 177.5gb 177.5gb
I'm not sure how to recover from this. I've been trying to delete documents that no longer needed, but I cant delete the entire index