ES not moving shards off node with high disk usage. (88% )

I'm having an odd occurrence.
I had cluster running out of disk space. Added more nodes, and all but two of the nodes had room made on them.

2 nodes are still sitting above the high watermark.
I added another 3 nodes. and it is again moving the data off the other nodes, but not the 2 that are the most full.

It eventually moved it off 1 of those two nodes. but I still have one node sitting at 88% full disk

I can move shards off manually. Or exclude the node and force it to start moving. But seems like the cluster allocation should be doing it itself.

I think part of the problem is this.
"explanation": "there are too many copies of the shard allocated to nodes with attribute [zone], there are [2] total configured shard copies for this shard id and [3] total attribute values, expected the allocated shard count per attribute [2] to be less than or equal to the upper bound of the required number of shards per attribute [1]"

I don't see where the upper bound is set however? zone awareness should be ensuring minimum per zone, not maximum from what I have read?

Any ideas on what is going on?

Adding the decision block from allocation explain. this is one of the new nodes I added block.

  "node_id": "nodeis",
  "node_name": "node_name",
  "transport_address": "ipaddress:9300",
  "node_attributes": {
    "xpack.installed": "true",
    "tier": "hot",
    "type": "general",
    "env": "dev",
    "zone": "us-gov-west-1b"
  },
  "node_decision": "worse_balance",
  "weight_ranking": 3,
  "deciders": [
    {
      "decider": "max_retry",
      "decision": "YES",
      "explanation": "shard has no previous failures"
    },
    {
      "decider": "replica_after_primary_active",
      "decision": "YES",
      "explanation": "shard is primary and can be allocated"
    },
    {
      "decider": "enable",
      "decision": "YES",
      "explanation": "all allocations are allowed"
    },
    {
      "decider": "node_version",
      "decision": "YES",
      "explanation": "can relocate primary shard from a node with version [6.3.1] to a node with equal-or-newer version [6.3.1]"
    },
    {
      "decider": "snapshot_in_progress",
      "decision": "YES",
      "explanation": "no snapshots are currently running"
    },
    {
      "decider": "restore_in_progress",
      "decision": "YES",
      "explanation": "ignored as shard is not being recovered from a snapshot"
    },
    {
      "decider": "filter",
      "decision": "YES",
      "explanation": "node passes include/exclude/require filters"
    },
    {
      "decider": "same_shard",
      "decision": "YES",
      "explanation": "the shard does not exist on the same node"
    },
    {
      "decider": "disk_threshold",
      "decision": "YES",
      "explanation": "enough disk for shard on node, free: [533.1gb], shard size: [16.8gb], free after allocating shard: [516.2gb]"
    },
    {
      "decider": "throttling",
      "decision": "YES",
      "explanation": "below shard recovery limit of outgoing: [0 < 2] incoming: [0 < 2]"
    },
    {
      "decider": "shards_limit",
      "decision": "YES",
      "explanation": "total shard limits are disabled: [index: -1, cluster: -1] <= 0"
    },
    {
      "decider": "awareness",
      "decision": "YES",
      "explanation": "node meets all awareness attribute requirements"
    }
  ]
},

No, allocation awareness is a maximum, i.e. an upper bound on the number of shards in each zone:

This tells us that although this node could hold the shard in question, it would make the balance of the cluster worse. Unfortunately it's hard to say why without looking at the allocation of all the shards.

So if I want to ensure that shards are split across zones, but allow more, how would I do it? Ie, i have 3 zones configured, so i want it to not pull all shards in 1 zone, but shard count of 9 so it has more shards than zones....

Link to entire GET /_cluster/allocation/explain?include_yes_decisions=true

https://drive.google.com/file/d/1kItSurzs5pEzHZcbNQ3OlCFrnp-y9vOB/view?usp=sharing

This confirms what I said above about there being no way of moving this shard to improve the cluster balance, but you have to look at the allocation of all the shards in the cluster to get a picture of the balance. Can you share the output of GET _cat/shards?

That's ok, allocation awareness just tries to divide all the shards evenly across zones.

However I note that you have some allocation filters in place. You must be careful when mixing allocation filters with awareness, because awareness doesn't take allocation filters into account so it may be applying an impossible constraint. Can you remove the allocation filters?

I can remove the allocation awareness. I cannot remove the allocation filters . We use the allocation filter to segregate data environments.

Is there a way to just ensure that it doesn't put all shards in the same zone? That is all i care about, or alternatively, just make sure each shard has a replica in another zone that can be promoted if a zone goes down, (zone's = amazon AZ btw)

Here is the cat shards
https://drive.google.com/file/d/1JMx5q1clHvS74F6hIk3Yqktz7qj1UBqJ/view?usp=sharing

(yes, i know the shard counts are not right for the size of the indices, haven't had time to fix yet)

Thanks, this cluster looks rather unbalanced because of those three environments:

$ cat cat_shards.txt  | awk '{print $8}' | sort | uniq -c
 258 ENV_One-general1
 237 ENV_One-general10
 234 ENV_One-general11
 234 ENV_One-general12
 264 ENV_One-general2
 263 ENV_One-general3
 265 ENV_One-general4
 264 ENV_One-general5
 255 ENV_One-general6
 262 ENV_One-general7
 259 ENV_One-general8
 255 ENV_One-general9
 558 ENV_Three-general1
 558 ENV_Three-general2
 558 ENV_Three-general3
 680 ENV_Two-general1
 680 ENV_Two-general2
 680 ENV_Two-general3

The nodes in ENV_Two and ENV_Three (presumably matching prod and stage from the allocation explanations) have over twice as many shards as in ENV_One (i.e. dev). The balancer's goal is for these counts to be even across the cluster. It's more usual to use completely separate clusters in different environments.

Also, re-reading your original post, it seems that the node with the least disk space is only at 88%. Elasticsearch only starts to move shards off a node once it's exceeded the high water mark, which defaults to 90%, so I think this is what we expect. It also only allocates shards to nodes below the low water mark, which defaults to 85%, so this node will not receive any new shards. There's more details in the docs.

Yes, allocation awareness is the right way to do this.

I just reviewed my visualization for this and it has now moved the data off that node.
(note, several times while this was happening it did reach 90, and did have room on other nodes to move the shards)

I have no idea why it suddenly moved the shards off this morning,

I have a (slightly speculative) theory. When a node hits 90% full Elasticsearch picks some shards to move to bring the node down below 90% again. It doesn't try and move the largest shards or anything like that, it just picks some arbitrary shards and starts relocating them. Your shards have a very wide variety of sizes, so sometimes it'll move a little shard and other times it'll move a big one. It looks like it was bumping along just below 90%, and every time it hit 90% it moved a small shard off elsewhere to bring it below 90% again. Then this morning it happened to choose some of the huge shards to move elsewhere instead, causing a much larger drop in disk usage.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.