Hi,
we're currently managing a ES cluster with cold/hot architecture, we have 25 cold nodes with 12 TB space each, and 42 hot nodes with 2 TB each.
For quite a while a few of our cold nodes are running pretty low in space, lower than the watermark low. while other have plenty of space (99% of space usage versus 81%). This we assume is because it tries to balance the number of shard on each nodes, and due to the heterogeneity of our cluster 's data, we have a lot of 50Gb shards but quite a few of 1 Gb/few Mb shards.
According to our stats most hot nodes have a balance of 120 shards, while the cold nodes are in the range of 659... those nodes that are low on space are mostly 5XX, so we're assuming is the balancing algo trying to send shards to an already almost full node because it contain higher than average sized shards.
We're wondering if there's a way to better balance our cluster taking into account disk space, or any other policy we can implement, like we're assuming in ES 8 with cluster.routing.allocation.balance.disk_usage
.
this is our cluster config (what we think is relevant to the case):
"node_concurrent_recoveries" : "32",
"disk" : {
"watermark" : {
"low" : "100gb",
"flood_stage" : "10gb",
"high" : "50gb"
}
"transient" : {
"cluster" : {
"routing" : {
"rebalance" : {
"enable" : "all"
},
"allocation" : {
"balance" : {
"index" : "5.0f",
"threshold" : "5.0f"
},
"cluster_concurrent_rebalance" : "5",
"enable" : "all"
}
}