Shrink operation in ILM is unusable when same_shard is set in multinode cluster

Hi, I am trying to find out if this is really the case: shrink operation via API or ILM is unusable or at least unreliable when when both these conditions apply:

  • cluster.routing.allocation.same_shard.host is set to true. This setting disallows allocation of more then 1 copy of the shard to the same host.
  • Multiple data nodes run on the same host. We use one node per physical attached disk.

Shrink operation on an index requires that a copy of every shard of that index is moved to one node. Lets name it the target node. In a multinode setup a shard on another node in the same host as the target node can block the move because of same_shard setting, and ILM gets stuck.

Example:

3 data hosts, each with 4 nodes, so 12 node cluster total.

ilmtest1 index with 6 shards and 1 replica.

shrink10m ILM policy tries to shrink the ilmtest1 index. ILM does the pre-checks, locks the index and tries to move the shards. Last log message from ILM:

moving index [ilmtest1] from [{"phase":"warm","action":"shrink","name":"set-single-node-allocation"}] to [{"phase":"warm","action":"shrink","name":"check-shrink-allocation"}] in policy [shrink10m]

ILM status for the index:

GET ilmtest1/_ilm/explain?human
{
  "indices" : {
    "ilmtest1" : {
      ...
      "action" : "shrink",
      "step" : "check-shrink-allocation",
      "shrink_index_name" : "shrink-d8ut-ilmtest1",
      "step_info" : {
        "message" : "Waiting for node [yyt9lJfxTD66O13E6UZPxg] to contain [6] shards, found [3], remaining [3]",
        "node_id" : "yyt9lJfxTD66O13E6UZPxg",
        "shards_left_to_allocate" : 3,
        "expected_shards" : 6
      }
    }
  }
}

Which node is target for the shrink job:

GET ilmtest1/_settings?filter_path=ilmtest1.settings.index.routing.allocation.require
{
  "ilmtest1" : {
    "settings" : {
      "index" : {
        "routing" : {
          "allocation" : {
            "require" : {
              "_id" : "yyt9lJfxTD66O13E6UZPxg"
            }
          }
        }
      }
    }
  }
}

Nodes list:

10.135.133.95    0 dr        -      7.17.1  m1-john-data2 yyt9lJfxTD66O13E6UZPxg

Shard list:

index    shard prirep state    docs store ip             node
ilmtest1 0     p      STARTED 50377 4.2mb 10.135.133.95  m1-john-data2  <- 0p on target node
ilmtest1 0     r      STARTED 50377 4.2mb 10.135.156.149 m3-john-data1
ilmtest1 1     p      STARTED 50380 4.2mb 10.135.104.36  m1-john-data3
ilmtest1 1     r      STARTED 50380 4.2mb 10.135.133.95  m1-john-data2  <- 1r on target node
ilmtest1 2     p      STARTED 49707 4.2mb 10.135.133.95  m1-john-data2  <- 2p on target node
ilmtest1 2     r      STARTED 49707 4.1mb 10.135.104.36  m3-john-data3
ilmtest1 3     p      STARTED 49659 4.2mb 10.135.104.36  m2-john-data3  <- 3p should move to target node !
ilmtest1 3     r      STARTED 49659 4.2mb 10.135.133.95  m2-john-data2  <- 3r block 3p move via same_shard
ilmtest1 4     p      STARTED 49857 4.2mb 10.135.156.149 m1-john-data1
ilmtest1 4     r      STARTED 49857 4.3mb 10.135.133.95  m3-john-data2
ilmtest1 5     p      STARTED 50020 4.2mb 10.135.133.95  m4-john-data2
ilmtest1 5     r      STARTED 50020 4.2mb 10.135.156.149 m2-john-data1

Check 3p allocation status:

GET /_cluster/allocation/explain?pretty
{
  "index": "ilmtest1",
  "shard": 3,
  "primary": true,
  "current_node": "m2-john-data3"
}

{
  "index" : "ilmtest1",
  "shard" : 3,
  "primary" : true,
  "can_remain_on_current_node" : "no",
  "can_move_to_other_node" : "no",
  "move_explanation" : "cannot move shard to another node, even though it is not allowed to remain on its current node",
  ...
  "node_allocation_decisions" : [
  ...
    {
      "node_id" : "yyt9lJfxTD66O13E6UZPxg",
      "node_name" : "m1-john-data2",
      "transport_address" : "10.135.133.95:9301",
      "node_attributes" : {
        "zone" : "do",
        "xpack.installed" : "true",
        "transform.node" : "false"
      },
      "node_decision" : "no",
      "weight_ranking" : 11,
      "deciders" : [
        {
          "decider" : "same_shard",
          "decision" : "NO",
          "explanation" : "a copy of this shard is already allocated to host address [10.135.133.95], on node [yyt9lJfxTD66O13E6UZPxg], and [cluster.routing.allocation.same_shard.host] is [true] which forbids more than one node on this host from holding a copy of this shard"
        }
      ]
    }
  ]
}

And now ILM for index ilmtest1 is stuck. Any way to avoid this without manually moving replica shards away from the target node?

One correction: it seems that the API requires that one manually moves the shards to the target node before calling shrink.

So this problem applies only when using ILM. Will update the topic if I can.

Interesting, this is definitely a bug but it's existed for years without anyone noticing before it seems. I opened #104793. As mentioned in the issue, allocation awareness handles this correctly, so I'd suggest using that instead of the same-host decider.

Thank you. I will look into using shard awareness.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.