Balancing disk usage on large clusters?

We have a relatively large cluster and one consistent issue we have seen from time to time is inconsistent disk usage balancing because ElasticSearch is balancing by shard count rather than shard resource consumption. Basically we will see all nodes have similar shard counts as expected however a few nodes might have been favored for small shards or 0 doc indexes. While I can address the 0 doc indexes relatively easily, the small indexes/shards are somewhat purposeful in that ILM will age that data out according to our expected retention. (So I do not just want to just try and make all shards equal in size)

Does anyone have some easy to consume resources for more efficiently balancing on disk usage as well?

Yep, it balances by shard count. I have seen people change balancing settings, but it's not something that we recommend.

Are you crossing any watermark levels with things as they are? What sort of differences are you seeing between the nodes? (Would a _cat/nodes?v&h=id,v,rp,dt,du,dup be possible to share?)

(PS it's Elasticsearch, no S :slight_smile: )

id   v      rp     dt    du   dup
Goe5 7.6.2  96  6.9tb 6.7tb 95.96
uclB 7.6.2  87  6.9tb 5.8tb 84.45
KxA4 7.6.2  99  6.9tb 6.5tb 94.26
VAwV 7.6.2  97  6.9tb   6tb 86.19
oZIl 7.6.2  97  6.9tb 6.3tb 90.38
-k0_ 7.6.2  99  6.9tb 6.6tb 95.68
_Asj 7.6.2  98  6.9tb 6.6tb 95.77
_fn_ 7.6.2  98  6.9tb 6.3tb 91.50
EiqT 7.6.2  89 17.4gb 5.8gb 33.73
l9ce 7.6.2  98 17.4gb 4.9gb 28.09
DQp6 7.6.2  98  6.9tb 6.5tb 93.79
s93T 7.6.2  98  7.2tb 6.4tb 88.89
QYoq 7.6.2  75 17.4gb 4.4gb 25.23
3rx_ 7.6.2  98  6.9tb 6.2tb 89.02
7iqI 7.6.2  79 17.4gb 5.1gb 29.53
xOAX 7.6.2  96  6.9tb 6.3tb 90.77
21pb 7.6.2 100  7.2tb 6.7tb 92.27
3xj1 7.6.2  97  6.9tb 6.2tb 89.28
_NE3 7.6.2  94  6.9tb 6.6tb 95.08
55ca 7.6.2  99  6.9tb 6.5tb 94.50
AArZ 7.6.2  95  6.9tb 6.6tb 94.91
dsL3 7.6.2  98  7.2tb 5.3tb 73.87
3Afq 7.6.2  88 17.4gb 4.9gb 28.12

Issues we see are such that in aggregate this cluster has enough space for the daily load of logging and metrics however for whatever reason the actual load of disk usage is not uniformly spread. Shards are balanced at this point in time for the cluster and baring the master/ml nodes (There are 3 masters, and 2 ml nodes) we can see there is are data nodes with as little as around 300GiB of space and data nodes with as much as 1.8TiB.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.