Elastic shard balancing / allocation

Hi
We are on Elastic 8.6 with 38 hot data nodes and ingesting about 140 different indices
Top 10 indices have indexing rate about 15-50K events/sec. 20 indices has 1-20 K events/sec. And remaining 100 indices indexing in low rate 0-1000 of events/sec.
All nodes have 8vcpu, 32GB RAM, 2TB SSD.
As you can see in table below the cluster overflows some nodes and some are almost empty despite the fact that they all have the same capacity and performance

could you help me with some ideas what I should do to fix this issue?
This impacts index latency and performance of the cluster.

cluster balancing paramteres are in default, I was thinking to adjust cluster.routing.allocation.balance.disk_usage from its default 2e-11f to some other value "2e-9f" but I do not have no experience how the value affects rebalancing behavior.

node shards disk.indices disk.total disk.used disk.percent
tela12_node 31 420.8gb 1.9tb 427.3gb 21
tela36_node 108 726.6gb 1.9tb 734.7gb 36
tela21_node 75 728.6gb 1.9tb 734.6gb 37
tela34_node 155 929.2gb 1.9tb 934.9gb 46
tela13_node 209 908.2gb 1.9tb 914.8gb 46
tela37_node 142 951.5gb 1.9tb 957.7gb 47
tela33_node 114 1000gb 1.9tb 1008.1gb 49
tela10_node 224 1009.8gb 1.9tb 1016.3gb 51
tela28_node 217 1tb 1.9tb 1tb 52
tela22_node 221 1tb 1.9tb 1tb 52
tela18_node 199 1tb 1.9tb 1tb 54
tela35_node 142 1.1tb 1.9tb 1.1tb 55
tela04_node 195 1tb 1.9tb 1.1tb 57
tela17_node 216 1.1tb 1.9tb 1.1tb 58
tela14_node 214 1.1tb 1.9tb 1.1tb 58
tela07_node 215 1tb 1.9tb 1.1tb 60
tela19_node 236 1.2tb 1.9tb 1.2tb 66
tela24_node 219 1.3tb 1.9tb 1.3tb 67
tela26_node 219 1.3tb 1.9tb 1.3tb 69
tela02_node 222 1.2tb 1.9tb 1.3tb 71
tela01_node 232 1.3tb 1.9tb 1.4tb 73
tela16_node 187 1.4tb 1.9tb 1.4tb 74
tela06_node 203 1.3tb 1.9tb 1.4tb 74
tela05_node 179 1.3tb 1.9tb 1.4tb 76
tela23_node 206 1.5tb 1.9tb 1.5tb 81
tela32_node 153 1.6tb 1.9tb 1.6tb 83
tela03_node 213 1.5tb 1.9tb 1.6tb 85
tela30_node 123 1.7tb 1.9tb 1.7tb 86
tela11_node 176 1.6tb 1.9tb 1.7tb 88
tela27_node 177 1.7tb 1.9tb 1.7tb 89
tela15_node 149 1.7tb 1.9tb 1.7tb 90
tela29_node 130 1.7tb 1.9tb 1.7tb 90
tela08_node 120 1.6tb 1.9tb 1.7tb 90
tela38_node 127 1.8tb 1.9tb 1.8tb 91
tela20_node 94 1.7tb 1.9tb 1.7tb 91
tela09_node 126 1.6tb 1.9tb 1.7tb 91
tela31_node 159 1.7tb 1.9tb 1.8tb 91
tela25_node 172 1.7tb 1.9tb 1.7tb 92

I setup index templates in appx this manner: hi-perf indices have more primary shards with sharads per node option so I utilise performance of more cpu.

Indexing rate kEPS Number of primary shards shards per node
30+ 15 1
20-30 15 1
15-20 10 1
10-15 8 1
5-10 6 any
1-5 3 any
0-1 1 any

this is my fourth post regarding this topic , the new version has brought some improvements, however this issue is still ongoing.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.