Disk usage/shard allocation problems during snapshot creation

version 7.17.12

last night my cluster stoped ingesting data. One node ran out of disk after snapshot started. That node normally has plenty of headroom:

57%
available: 1.85TB
total: 4.30TB

logs show:

[2023-09-12T00:30:00,182][INFO ][o.e.s.SnapshotsService   ] [secesprd01] snapshot [new-daily:daily-2023.09.11-w21i9jn9qsifim7l65vc0a/zHxRzk2QTPivX6-y08d0mg] started
[2023-09-12T00:30:39,725][INFO ][o.e.c.r.a.DiskThresholdMonitor] [secesprd01] low disk watermark [85%] no longer exceeded on [DsJqLibJQSi9D2lIAUHOrw][secesprd09][/data/elasticsearch/security/nodes/0] free: 534gb[
19%]
[2023-09-12T00:30:39,737][WARN ][o.e.c.r.a.d.DiskThresholdDecider] [secesprd01] after allocating [[arkime_sessions3-230905][0], node[6UDagJW2T3eWM-0PQJ0rMA], [P], s[STARTED], a[id=wO2cjVlVQvK-HZoTFtMTtw]] node [D
sJqLibJQSi9D2lIAUHOrw] would have more than the allowed 10% free disk threshold (5.3% free), preventing allocation
[2023-09-12T00:30:39,737][WARN ][o.e.c.r.a.d.DiskThresholdDecider] [secesprd01] after allocating [[arkime_sessions3-230911][1], node[6UDagJW2T3eWM-0PQJ0rMA], [P], s[STARTED], a[id=lYe1STNvQpmDkYTQ_UZSDg]] node [D
sJqLibJQSi9D2lIAUHOrw] would have more than the allowed 10% free disk threshold (3.8% free), preventing allocation
.......
[2023-09-12T00:56:10,201][WARN ][o.e.c.r.a.d.DiskThresholdDecider] [secesprd01] after allocating [[arkime_sessions3-230910][1], node[kAWPcpoxSNSN9WlUsYlQlg], [P], s[STARTED], a[id=tzbQK9OFS7OBr2csxLeC2g]] node [DsJqLibJQSi9D2lIAUHOrw] would have less than the required threshold of 0b free (currently 422.1gb free, estimated shard size is 789.2gb), preventing allocation

Then no more allocation errors and the snapshot finished hours after the disk problem went away. So it seems unlikely that the problem is related to the snapshot.

I have moved the mount point of the backup dir out of the 'data path' as a precaution I did check that the backup mount had failed (as it occasionally does) but it looked good.

The data path has partition to itself. Nothing else should be writing into it.

Any ideas what happened?

I now think I know what happened. We have a series of indexes that roll every day and we were keeping 7 days of data. These indexes store network flow data and were taking up about 7-800 GB/day. Over the weekend there were some changes in the network setup and the indexes are now tracking at just over 1TB per day.

At midnight the index rolls over before the old index got deleted. The (unexpected) 30% increase in size of those indexes pushed us over the edge last night. As soon as the old index was deleted everything was fine.

The lifecycle policy has been changed an one old index has been deleted so hopefully all will be well tonight!

Is that estimate accurate? That's pretty big for a single shard. In particular, the allocator kind of assumes/requires that there's room for a shard in the gap above each watermark (i.e. between low and high, between high and flood-stage, and between flood-stage and disk-full). Maybe it'd help to have more/smaller shards here.

Thanks David (as always : )

yes, the shard is large. I was working on the assumption of one shard per eligible node. I have 3 hot nodes hence 3 shards. Should I double that? I did notice the shard size recommendation when I was researching the problem. One of those nodes is new and has less disk and hence smaller overhead. I will increase the disk allocation on that machine. I was just looking at the amount of free space on the other hot nodes.

I have also turned of force_merge on warm as that (I assume) requires the shard to be duplicated

Multiple shards per node is normal. Generally we try to avoid primary shards for a given index being located on the same node for several reason, one being ingest load distribution (high scale use cases). Keeping shard sizes around is 50gb is the best practice.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.