ES 7.17: Large cluster with big difference in shard size keeps high disparity between cold nodes sizing

carlosmg1 · February 5, 2025, 4:35pm

Hi,

we're currently managing a ES cluster with cold/hot architecture, we have 25 cold nodes with 12 TB space each, and 42 hot nodes with 2 TB each.

For quite a while a few of our cold nodes are running pretty low in space, lower than the watermark low. while other have plenty of space (99% of space usage versus 81%). This we assume is because it tries to balance the number of shard on each nodes, and due to the heterogeneity of our cluster 's data, we have a lot of 50Gb shards but quite a few of 1 Gb/few Mb shards.

According to our stats most hot nodes have a balance of 120 shards, while the cold nodes are in the range of 659... those nodes that are low on space are mostly 5XX, so we're assuming is the balancing algo trying to send shards to an already almost full node because it contain higher than average sized shards.

We're wondering if there's a way to better balance our cluster taking into account disk space, or any other policy we can implement, like we're assuming in ES 8 with cluster.routing.allocation.balance.disk_usage.

this is our cluster config (what we think is relevant to the case):

          "node_concurrent_recoveries" : "32",
          "disk" : {
            "watermark" : {
              "low" : "100gb",
              "flood_stage" : "10gb",
              "high" : "50gb"
            }

  "transient" : {
    "cluster" : {
      "routing" : {
        "rebalance" : {
          "enable" : "all"
        },
        "allocation" : {
          "balance" : {
            "index" : "5.0f",
            "threshold" : "5.0f"
          },
          "cluster_concurrent_rebalance" : "5",
          "enable" : "all"
        }
      }

Topic		Replies	Views
Shard allocation based on shard size Elasticsearch	14	1008	January 18, 2021
ES node storage size mixing? Elasticsearch	4	409	April 8, 2020
Is there a way to rebalance data nodes by disk space and not shards? Elasticsearch	5	4483	July 1, 2021
Disk usage difference between data nodes Elasticsearch	6	1847	August 3, 2021
Unbalanced disk usage with ES 6.1.3 Elasticsearch	4	2559	May 1, 2018

ES 7.17: Large cluster with big difference in shard size keeps high disparity between cold nodes sizing

Related topics