Shard relocation storms when cluster disk low

We have a 27 data node ES 5.5 cluster with 1.7TB of disk per node. The ES team has created an amazingly easy to use and high performance system. Our cluster normally works beautifully. It ingests over a billion logs per day.

We're trying to run the cluster with less free space than we've maintained in the past (all nodes currently have between 13% and 15% free) and have run into a problem.

Occasionally one node will become imbalanced with a few large, active shards and its free space will drop to under 5%. That triggers disk high water shard relocations. ES will schedule too many moves off this node and cause more high water relocations on other nodes. This situation can cascade out of control like a run-away reactor.

Our shard sizes vary with some indices having shards under 0.1gb, but others around 40gb. 2672 active shards use 40TB of disk. In order to get reasonable allocation for diverse shard sizes, we set cluster.routing.allocation.balance.shard to 0 so that an index's shards are evenly spread across nodes. (This also greatly reduced hot spots in the cluster.)

We'd like to have ES perform the minimal set of relocations necessary to bring a node back above the high water.

We'd also like a variant of BalancedShardsAllocator that prefers nodes with free disk space so that shards are preferentially allocated to nodes with more disk.
When a node runs out of disk it would be great to have a small number of shards move from there to the nodes with the most disk.

Has anyone else seen this behavior? Are there any cluster configurations that would help eliminate these relocation storms?

Does anyone know of work being done to make balancing more disk aware? Are there ES versions that have better behavior than 5.5?

Thanks in advance!

  • Ken
2 Likes

I'm pinging @dakrone as he worked a lot IIRC on this topic.
I know that there have been a lot of improvements in 6.x series but I'm unsure about your specific questions though.

Our cluster normally works beautifully. It ingests over a billion logs per day.

This is great! :blush:

Hi Ken,

There haven't been many improvements to this area for single-path data configurations since 5.5. I have a couple of questions for you though:

  • What number of concurrent relocations are you using?
  • You said you're running at 13-15% free disk, what are your high and low watermarks configured at (you said high was at 5%, but what is the low watermark?)

We haven't changed cluster.routing.allocation.cluster_concurrent_rebalance from the default of 2. Here are the other settings:

    "cluster" : {
      "routing" : {
        "allocation" : {
          "balance" : {
            "index" : "1.0",
            "shard" : "0.0"
          },
          "node_concurrent_recoveries" : "1",
          "disk" : {
            "watermark" : {
              "low" : "120gb",
              "high" : "80gb"
            }
          }
        }
      },

We have set node_concurrent_recoveries higher in the past (it was helpful for recovering from dead data nodes), but I have it set to 1 now because it seems to reduce the size of a rebalancing storm and will hopefully give us time to intervene.

The cluster runs in a steady state with size of new ingest the same as old indices that are expiring. Sometimes we get unlucky and several of our large indices are placed on a single node pushing it into high water. It happened twice "in the wild" and I repeated it twice more by adjusting the water marks.

When a node hits high water, the first set of shard relocations frees up hundreds of gigs of space. I haven't figured out why so many shards move at once.

Thanks for looking at this. I really appreciate it.

  • Ken

Thanks for the info, the concurrent recoveries is good low (like you have it) so you don't relocate too many things at once.

Sometimes we get unlucky and several of our large indices are placed on a single node pushing it into high water.

This is part that it would be good to get Elasticsearch to avoid. When checking whether a shard can be allocted, Elasticsearch does check that allocating the shard to the node doesn't put it over the high watermark: https://github.com/elastic/elasticsearch/blob/5971eb83c434c1e52edbda2c038bf28740bab6d2/server/src/main/java/org/elasticsearch/cluster/routing/allocation/decider/DiskThresholdDecider.java#L217

In your case, when you say that large indices are placed on a single node, is this when the indices are newly created, and then they grow to the size that passes the high watermark?

Could you also attach the output of /_cat/allocation?v and /_cat/shards?v please (assuming it's not too sensitive)

The system is just about to expire old indices and allocate new daily indices, so this is the low point of a steady state oscillation. 4 nodes are down to around 10% and the system as a whole is at 12% free. That's a bit lower than usual, but it will probably (hopefully!) bounce back.

Shard allocation always honors the low water mark and with the index weighting they are always spread across the maximum number of nodes. This usually works well.

We have several daily indices that are larger than the others. If one node is unlucky enough to get shards from all of the large indices and its expiring shards don't free up enough space, it can hit the high water by the end of the day. In this case we'll usually see a node at 5% and another at 20%. (I say usually, but it's only happened in the wild twice!)

When the high water is hit on a single node, I don't see just one recovery happening. It's more like a shard per node? When I modified the high water mark to induce this problem, 22 shards started relocating. I didn't verify all of them, but most were replicas.

$ curl -s /_cat/allocation?v
shards disk.indices disk.used disk.avail disk.total disk.percent host       ip         node
    87        1.4tb     1.5tb    183.2gb      1.7tb           89 10.2.2.31  10.2.2.31  data-06ffe365e540796c2
    81        1.3tb     1.4tb    226.1gb      1.7tb           87 10.2.1.75  10.2.1.75  data-0c04b74f4689ca12a
    93        1.3tb     1.4tb    219.7gb      1.7tb           87 10.2.0.185 10.2.0.185 data-0ef6acb62154a2d40
   102        1.4tb     1.4tb    216.3gb      1.7tb           87 10.2.0.97  10.2.0.97  data-05f7213f0bff15b19
   100        1.3tb     1.4tb    221.4gb      1.7tb           87 10.2.0.90  10.2.0.90  data-0544cab5837b38f22
    94        1.4tb     1.4tb    208.8gb      1.7tb           88 10.2.0.33  10.2.0.33  data-0e4765d530d0ba1ad
    92        1.4tb     1.4tb    212.3gb      1.7tb           87 10.2.0.136 10.2.0.136 data-099a05c3a185e0f97
   133        1.4tb     1.4tb    212.8gb      1.7tb           87 10.2.1.130 10.2.1.130 data-08f3063c6c957eba6
    83        1.4tb     1.4tb    214.7gb      1.7tb           87 10.2.2.104 10.2.2.104 data-007f12ecdf016502c
   107        1.4tb     1.4tb      214gb      1.7tb           87 10.2.2.141 10.2.2.141 data-03be4de1275db4b28
    81        1.3tb     1.4tb    234.4gb      1.7tb           86 10.2.2.106 10.2.2.106 data-09440c5547821a445
    91        1.4tb     1.5tb    184.9gb      1.7tb           89 10.2.2.249 10.2.2.249 data-0b1be52a100bbaf26
    92        1.3tb     1.4tb      234gb      1.7tb           86 10.2.0.114 10.2.0.114 data-0e1a1af8a278bc1d1
    90        1.4tb     1.5tb    171.1gb      1.7tb           90 10.2.1.180 10.2.1.180 data-07ec6435e49996c0d
   103        1.4tb     1.4tb    214.8gb      1.7tb           87 10.2.1.28  10.2.1.28  data-062b7a5a97f43a30e
   157        1.4tb     1.4tb    214.4gb      1.7tb           87 10.2.2.43  10.2.2.43  data-0e4d148ae48ac24c9
   105        1.4tb     1.4tb    219.2gb      1.7tb           87 10.2.1.217 10.2.1.217 data-04ed1e79257d2cd0b
    90        1.3tb     1.4tb    224.4gb      1.7tb           87 10.2.1.112 10.2.1.112 data-09f089b63e5ee8354
    92        1.4tb     1.4tb      210gb      1.7tb           87 10.2.1.157 10.2.1.157 data-0e943dcb33500eb6a
   107        1.3tb     1.4tb    235.4gb      1.7tb           86 10.2.0.207 10.2.0.207 data-03281cda6117dfb5c
    90        1.3tb     1.4tb    223.7gb      1.7tb           87 10.2.1.136 10.2.1.136 data-093a3f3f5f6c9733b
   110        1.3tb     1.4tb    235.9gb      1.7tb           86 10.2.0.126 10.2.0.126 data-0d1cacdb4631db578
   111        1.3tb     1.4tb      232gb      1.7tb           86 10.2.2.165 10.2.2.165 data-0058f931c7c204426
    81        1.4tb     1.5tb    177.6gb      1.7tb           89 10.2.1.187 10.2.1.187 data-0f854eda14ccb32f4
    93        1.4tb     1.4tb    213.3gb      1.7tb           87 10.2.2.220 10.2.2.220 data-02d37f39696295bab
   100        1.3tb     1.4tb    223.6gb      1.7tb           87 10.2.2.245 10.2.2.245 data-056a37229d96ead4e
   131        1.4tb     1.4tb    206.9gb      1.7tb           88 10.2.0.60  10.2.0.60  data-062b3544138251188
  • Ken

I'm going to try to reproduce the problem on a smaller cluster. I don't think it's a function of data volume. Hopefully that will give you more information on whether this is a bug or just a system limitation.

Thanks for looking at this.

  • Ken

I am going to preallocate daily indices ahead of time and move any poorly placed shards. That won't work for everyone, but we have very predictable shard sizes so automation can decide whether it's a poor placement in advance. Moving an empty shard is very fast. :slight_smile:

This "just" leaves us with optimal disaster recovery shard placement. Hopefully when that happens we'll be on the version of Elasticsearch that has best-fit shard allocation taking free disk into account!

Ken

I've linked this thread as a +1 for #17213 since I think that idea might help here.

In the meantime, I have a hunch that if you can do the expiry of the old (large) indices before creating the new (large) ones, with rebalancing temporarily disabled, then the new ones will be allocated in the spaces that the old ones have left, based entirely on shard counts. Could you try that?

There's also the index.routing.allocation.total_shards_per_node setting which can limit the number of shards of each index allocated to each node. It affects all indices, and is a hard limit so might leave you with a yellow cluster sometimes, but it might help.

We switched to cluster.routing.allocation.balance.index: 1.0 a couple months ago to solve node hotspot issues. The indices that are expiring don't satisfy the flat index balance requirement, but as soon as they are gone I will try deletion immediately before new index creation.

When we upgrade to 6.x, it may be best to split the cluster up. The mixed work load with varying index sizes seems to cause more tuning trouble than running multiple clusters each having a consistent work load.

Ken

D'oh! I made a mistake when thinking the old indices were not rebalanced when we made the weighting change to cluster.routing.allocation.balance. Everything is flat across all nodes so I can do that delete experiment today.

Ken

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.