version 7.17.12
last night my cluster stoped ingesting data. One node ran out of disk after snapshot started. That node normally has plenty of headroom:
57%
available: 1.85TB
total: 4.30TB
logs show:
[2023-09-12T00:30:00,182][INFO ][o.e.s.SnapshotsService ] [secesprd01] snapshot [new-daily:daily-2023.09.11-w21i9jn9qsifim7l65vc0a/zHxRzk2QTPivX6-y08d0mg] started
[2023-09-12T00:30:39,725][INFO ][o.e.c.r.a.DiskThresholdMonitor] [secesprd01] low disk watermark [85%] no longer exceeded on [DsJqLibJQSi9D2lIAUHOrw][secesprd09][/data/elasticsearch/security/nodes/0] free: 534gb[
19%]
[2023-09-12T00:30:39,737][WARN ][o.e.c.r.a.d.DiskThresholdDecider] [secesprd01] after allocating [[arkime_sessions3-230905][0], node[6UDagJW2T3eWM-0PQJ0rMA], [P], s[STARTED], a[id=wO2cjVlVQvK-HZoTFtMTtw]] node [D
sJqLibJQSi9D2lIAUHOrw] would have more than the allowed 10% free disk threshold (5.3% free), preventing allocation
[2023-09-12T00:30:39,737][WARN ][o.e.c.r.a.d.DiskThresholdDecider] [secesprd01] after allocating [[arkime_sessions3-230911][1], node[6UDagJW2T3eWM-0PQJ0rMA], [P], s[STARTED], a[id=lYe1STNvQpmDkYTQ_UZSDg]] node [D
sJqLibJQSi9D2lIAUHOrw] would have more than the allowed 10% free disk threshold (3.8% free), preventing allocation
.......
[2023-09-12T00:56:10,201][WARN ][o.e.c.r.a.d.DiskThresholdDecider] [secesprd01] after allocating [[arkime_sessions3-230910][1], node[kAWPcpoxSNSN9WlUsYlQlg], [P], s[STARTED], a[id=tzbQK9OFS7OBr2csxLeC2g]] node [DsJqLibJQSi9D2lIAUHOrw] would have less than the required threshold of 0b free (currently 422.1gb free, estimated shard size is 789.2gb), preventing allocation
Then no more allocation errors and the snapshot finished hours after the disk problem went away. So it seems unlikely that the problem is related to the snapshot.
I have moved the mount point of the backup dir out of the 'data path' as a precaution I did check that the backup mount had failed (as it occasionally does) but it looked good.
The data path has partition to itself. Nothing else should be writing into it.
Any ideas what happened?