Hi
We are on Elastic 8.6 with 38 hot data nodes and ingesting about 140 different indices
Top 10 indices have indexing rate about 15-50K events/sec. 20 indices has 1-20 K events/sec. And remaining 100 indices indexing in low rate 0-1000 of events/sec.
All nodes have 8vcpu, 32GB RAM, 2TB SSD.
As you can see in table below the cluster overflows some nodes and some are almost empty despite the fact that they all have the same capacity and performance
could you help me with some ideas what I should do to fix this issue?
This impacts index latency and performance of the cluster.
cluster balancing paramteres are in default, I was thinking to adjust cluster.routing.allocation.balance.disk_usage from its default 2e-11f to some other value "2e-9f" but I do not have no experience how the value affects rebalancing behavior.
node | shards | disk.indices | disk.total | disk.used | disk.percent |
---|---|---|---|---|---|
tela12_node | 31 | 420.8gb | 1.9tb | 427.3gb | 21 |
tela36_node | 108 | 726.6gb | 1.9tb | 734.7gb | 36 |
tela21_node | 75 | 728.6gb | 1.9tb | 734.6gb | 37 |
tela34_node | 155 | 929.2gb | 1.9tb | 934.9gb | 46 |
tela13_node | 209 | 908.2gb | 1.9tb | 914.8gb | 46 |
tela37_node | 142 | 951.5gb | 1.9tb | 957.7gb | 47 |
tela33_node | 114 | 1000gb | 1.9tb | 1008.1gb | 49 |
tela10_node | 224 | 1009.8gb | 1.9tb | 1016.3gb | 51 |
tela28_node | 217 | 1tb | 1.9tb | 1tb | 52 |
tela22_node | 221 | 1tb | 1.9tb | 1tb | 52 |
tela18_node | 199 | 1tb | 1.9tb | 1tb | 54 |
tela35_node | 142 | 1.1tb | 1.9tb | 1.1tb | 55 |
tela04_node | 195 | 1tb | 1.9tb | 1.1tb | 57 |
tela17_node | 216 | 1.1tb | 1.9tb | 1.1tb | 58 |
tela14_node | 214 | 1.1tb | 1.9tb | 1.1tb | 58 |
tela07_node | 215 | 1tb | 1.9tb | 1.1tb | 60 |
tela19_node | 236 | 1.2tb | 1.9tb | 1.2tb | 66 |
tela24_node | 219 | 1.3tb | 1.9tb | 1.3tb | 67 |
tela26_node | 219 | 1.3tb | 1.9tb | 1.3tb | 69 |
tela02_node | 222 | 1.2tb | 1.9tb | 1.3tb | 71 |
tela01_node | 232 | 1.3tb | 1.9tb | 1.4tb | 73 |
tela16_node | 187 | 1.4tb | 1.9tb | 1.4tb | 74 |
tela06_node | 203 | 1.3tb | 1.9tb | 1.4tb | 74 |
tela05_node | 179 | 1.3tb | 1.9tb | 1.4tb | 76 |
tela23_node | 206 | 1.5tb | 1.9tb | 1.5tb | 81 |
tela32_node | 153 | 1.6tb | 1.9tb | 1.6tb | 83 |
tela03_node | 213 | 1.5tb | 1.9tb | 1.6tb | 85 |
tela30_node | 123 | 1.7tb | 1.9tb | 1.7tb | 86 |
tela11_node | 176 | 1.6tb | 1.9tb | 1.7tb | 88 |
tela27_node | 177 | 1.7tb | 1.9tb | 1.7tb | 89 |
tela15_node | 149 | 1.7tb | 1.9tb | 1.7tb | 90 |
tela29_node | 130 | 1.7tb | 1.9tb | 1.7tb | 90 |
tela08_node | 120 | 1.6tb | 1.9tb | 1.7tb | 90 |
tela38_node | 127 | 1.8tb | 1.9tb | 1.8tb | 91 |
tela20_node | 94 | 1.7tb | 1.9tb | 1.7tb | 91 |
tela09_node | 126 | 1.6tb | 1.9tb | 1.7tb | 91 |
tela31_node | 159 | 1.7tb | 1.9tb | 1.8tb | 91 |
tela25_node | 172 | 1.7tb | 1.9tb | 1.7tb | 92 |
I setup index templates in appx this manner: hi-perf indices have more primary shards with sharads per node option so I utilise performance of more cpu.
Indexing rate kEPS | Number of primary shards | shards per node |
---|---|---|
30+ | 15 | 1 |
20-30 | 15 | 1 |
15-20 | 10 | 1 |
10-15 | 8 | 1 |
5-10 | 6 | any |
1-5 | 3 | any |
0-1 | 1 | any |
this is my fourth post regarding this topic , the new version has brought some improvements, however this issue is still ongoing.