Shrink of shard fails with error "source primary is allocated on another node"

We are running an ElasticCloud Cluster with 3 nodes and recently made a change to the lifecycle policy of one of our Data Streams:
from:

{
  "policy": {
    "phases": {
      "hot": {
        "actions": {
          "rollover": {
            "max_size": "10GB",
            "max_age": "7d"
          }
        }
      },
      "warm": {
        "actions": {
          "shrink" : {
            "number_of_shards": 1
          }
        }
      },
      "delete": {
        "min_age": "31d",
        "actions": {
          "delete": {}
        }
      }
    }
  }
}

to:

{
  "policy": {
    "phases": {
      "hot": {
        "min_age": "0ms",
        "actions": {
          "rollover": {
            "max_primary_shard_size": "50gb",
            "max_age": "7d"
          }
        }
      },
      "warm": {
        "min_age": "0ms",
        "actions": {
          "shrink": {
            "max_primary_shard_size": "50gb"
          }
        }
      },
      "delete": {
        "min_age": "31d",
        "actions": {
          "delete": {
            "delete_searchable_snapshot": true
          }
        }
      }
    }
  }
}

The resulting shrink action on the rollover failed. When calling the _cluster/allocation/explain endpoint we see two of the nodes are failing because they don't have all the shards, and the third one fails with error:
"source primary is allocated on another node" lets call this node "node-0"

When examining the index that should have been shrunk i see that it has a new routing rule that requires it to be allocated to node-0. But one of the shards allocated is not a primary shard.

A shrink index has been created, but one of its shards(primary and replica) are unassigned. This shard is also the same that only have a replica shard of the original index on node-0. Further more the shrink-index contains 2/3 of the original data.

Any explanation on how this could happen?

1 Like

By the way our elastic version is 7.17.6

Welcome to our community! :smiley:

Can you share the full output of that please?

{
  "index" : "[FAILED_SHRINK_INDEX_NAME]",
  "node_allocation_decisions" : [
    {
      "node_name" : "instance-0000000000",
      "deciders" : [
        {
          "decider" : "resize",
          "decision" : "NO",
          "explanation" : "source primary is allocated on another node"
        },
        {
          "decider" : "filter",
          "decision" : "NO",
          "explanation" : """initial allocation of the shrunken index is only allowed on nodes [_id:"[INSTANCE1]"] that hold a copy of every shard in the index"""
        }
      ]
    },
    {
      "node_name" : "instance-0000000003",
      "deciders" : [
        {
          "decider" : "filter",
          "decision" : "NO",
          "explanation" : """initial allocation of the shrunken index is only allowed on nodes [_id:"[INSTANCE1]"] that hold a copy of every shard in the index"""
        }
      ]
    },
    {
      "node_name" : "instance-0000000001",
      "deciders" : [
        {
          "decider" : "resize",
          "decision" : "NO",
          "explanation" : "source primary is allocated on another node"
        }
      ]
    }
  ]
}

We also hit this exact same issue, about 1 week after turning on ILM for the first time.

Captured the allocation explain output and the current shard allocation of the source index from the shrink operation:

/_cluster/allocation/explain output
{
    "note": "No shard was specified in the explain API request, so this response explains a randomly chosen unassigned shard. There may be other unassigned shards in this cluster which cannot be assigned for different reasons. It may not be possible to assign this shard until one of the other shards is assigned correctly. To explain the allocation of other shards (whether assigned or unassigned) you must specify the target shard in the request to this API.",
    "index": "shrink-c0tp-v60.agentprocessevent@1m-001820",
    "shard": 7,
    "primary": true,
    "current_state": "unassigned",
    "unassigned_info": {
        "reason": "INDEX_CREATED",
        "at": "2022-12-18T22:53:44.659Z",
        "last_allocation_status": "no"
    },
    "can_allocate": "no",
    "allocate_explanation": "cannot allocate because allocation is not permitted to any of the nodes",
    "node_allocation_decisions": [
        {
            "node_id": "2BvqNyIjT3W0KfWvC9sOkg",
            "node_name": "elasticsearch-0-es-warm-19",
            "transport_address": "10.64.130.4:9300",
            "node_attributes": {
                "k8s_node_name": "gke-eu-cluster-0-e2-gen4-22399c39-dqwm",
                "warm": "true",
                "xpack.installed": "true",
                "zone": "europe-west1-d",
                "transform.node": "false"
            },
            "node_decision": "no",
            "weight_ranking": 15,
            "deciders": [
                {
                    "decider": "resize",
                    "decision": "NO",
                    "explanation": "source primary is allocated on another node"
                },
                {
                    "decider": "filter",
                    "decision": "NO",
                    "explanation": "initial allocation of the shrunken index is only allowed on nodes [_id:\"KpGc5O6eTpKT5Tr4tTW0uQ\"] that hold a copy of every shard in the index"
                }
            ]
        },
        {
            "node_id": "fpIRkV8qTDO29ys5r_wx_g",
            "node_name": "elasticsearch-0-es-warm-17",
            "transport_address": "10.64.182.6:9300",
            "node_attributes": {
                "k8s_node_name": "gke-eu-cluster-0-e2-gen4-22399c39-hjaj",
                "warm": "true",
                "xpack.installed": "true",
                "zone": "europe-west1-d",
                "transform.node": "false"
            },
            "node_decision": "no",
            "weight_ranking": 16,
            "deciders": [
                {
                    "decider": "resize",
                    "decision": "NO",
                    "explanation": "source primary is allocated on another node"
                },
                {
                    "decider": "filter",
                    "decision": "NO",
                    "explanation": "initial allocation of the shrunken index is only allowed on nodes [_id:\"KpGc5O6eTpKT5Tr4tTW0uQ\"] that hold a copy of every shard in the index"
                }
            ]
        },
        {
            "node_id": "r8JQSVObQ42Bu60BEeGVMg",
            "node_name": "elasticsearch-0-es-warm-5",
            "transport_address": "10.64.169.6:9300",
            "node_attributes": {
                "k8s_node_name": "gke-eu-cluster-0-e2-gen4-22399c39-vm1a",
                "warm": "true",
                "xpack.installed": "true",
                "zone": "europe-west1-d",
                "transform.node": "false"
            },
            "node_decision": "no",
            "weight_ranking": 17,
            "deciders": [
                {
                    "decider": "resize",
                    "decision": "NO",
                    "explanation": "source primary is allocated on another node"
                },
                {
                    "decider": "filter",
                    "decision": "NO",
                    "explanation": "initial allocation of the shrunken index is only allowed on nodes [_id:\"KpGc5O6eTpKT5Tr4tTW0uQ\"] that hold a copy of every shard in the index"
                }
            ]
        },
        {
            "node_id": "bYEGLS8qS1OBJ62l3KuKmw",
            "node_name": "elasticsearch-0-es-warm-8",
            "transport_address": "10.64.171.5:9300",
            "node_attributes": {
                "k8s_node_name": "gke-eu-cluster-0-e2-gen4-22399c39-6bhd",
                "warm": "true",
                "xpack.installed": "true",
                "zone": "europe-west1-d",
                "transform.node": "false"
            },
            "node_decision": "no",
            "weight_ranking": 18,
            "deciders": [
                {
                    "decider": "resize",
                    "decision": "NO",
                    "explanation": "source primary is allocated on another node"
                },
                {
                    "decider": "filter",
                    "decision": "NO",
                    "explanation": "initial allocation of the shrunken index is only allowed on nodes [_id:\"KpGc5O6eTpKT5Tr4tTW0uQ\"] that hold a copy of every shard in the index"
                }
            ]
        },
        {
            "node_id": "iEPAh6X4T5SZD3zKcTE4TQ",
            "node_name": "elasticsearch-0-es-warm-3",
            "transport_address": "10.64.153.3:9300",
            "node_attributes": {
                "k8s_node_name": "gke-eu-cluster-0-e2-gen4-bdb36591-2rhi",
                "warm": "true",
                "xpack.installed": "true",
                "zone": "europe-west1-b",
                "transform.node": "false"
            },
            "node_decision": "no",
            "weight_ranking": 19,
            "deciders": [
                {
                    "decider": "resize",
                    "decision": "NO",
                    "explanation": "source primary is allocated on another node"
                },
                {
                    "decider": "filter",
                    "decision": "NO",
                    "explanation": "initial allocation of the shrunken index is only allowed on nodes [_id:\"KpGc5O6eTpKT5Tr4tTW0uQ\"] that hold a copy of every shard in the index"
                }
            ]
        },
        {
            "node_id": "RuGVVrmBQomXmzJrfb-Hvw",
            "node_name": "elasticsearch-0-es-warm-2",
            "transport_address": "10.64.174.5:9300",
            "node_attributes": {
                "k8s_node_name": "gke-eu-cluster-0-e2-gen4-22399c39-sybb",
                "warm": "true",
                "xpack.installed": "true",
                "zone": "europe-west1-d",
                "transform.node": "false"
            },
            "node_decision": "no",
            "weight_ranking": 20,
            "deciders": [
                {
                    "decider": "resize",
                    "decision": "NO",
                    "explanation": "source primary is allocated on another node"
                },
                {
                    "decider": "filter",
                    "decision": "NO",
                    "explanation": "initial allocation of the shrunken index is only allowed on nodes [_id:\"KpGc5O6eTpKT5Tr4tTW0uQ\"] that hold a copy of every shard in the index"
                }
            ]
        },
        {
            "node_id": "gbQY19IgTo28ABTg8QvyFw",
            "node_name": "elasticsearch-0-es-warm-9",
            "transport_address": "10.64.150.7:9300",
            "node_attributes": {
                "k8s_node_name": "gke-eu-cluster-0-e2-gen4-bdb36591-vgxz",
                "warm": "true",
                "xpack.installed": "true",
                "zone": "europe-west1-b",
                "transform.node": "false"
            },
            "node_decision": "no",
            "weight_ranking": 21,
            "deciders": [
                {
                    "decider": "resize",
                    "decision": "NO",
                    "explanation": "source primary is allocated on another node"
                },
                {
                    "decider": "filter",
                    "decision": "NO",
                    "explanation": "initial allocation of the shrunken index is only allowed on nodes [_id:\"KpGc5O6eTpKT5Tr4tTW0uQ\"] that hold a copy of every shard in the index"
                }
            ]
        },
        {
            "node_id": "J1hIgk9zSpCydd2BMcoFTA",
            "node_name": "elasticsearch-0-es-warm-1",
            "transport_address": "10.64.152.5:9300",
            "node_attributes": {
                "k8s_node_name": "gke-eu-cluster-0-e2-gen4-2898672c-13n8",
                "warm": "true",
                "xpack.installed": "true",
                "zone": "europe-west1-c",
                "transform.node": "false"
            },
            "node_decision": "no",
            "weight_ranking": 22,
            "deciders": [
                {
                    "decider": "resize",
                    "decision": "NO",
                    "explanation": "source primary is allocated on another node"
                },
                {
                    "decider": "filter",
                    "decision": "NO",
                    "explanation": "initial allocation of the shrunken index is only allowed on nodes [_id:\"KpGc5O6eTpKT5Tr4tTW0uQ\"] that hold a copy of every shard in the index"
                }
            ]
        },
        {
            "node_id": "rqS7V2UyTuWbsXOe9piKZw",
            "node_name": "elasticsearch-0-es-warm-13",
            "transport_address": "10.64.164.3:9300",
            "node_attributes": {
                "k8s_node_name": "gke-eu-cluster-0-e2-gen4-2898672c-uzkm",
                "warm": "true",
                "xpack.installed": "true",
                "zone": "europe-west1-c",
                "transform.node": "false"
            },
            "node_decision": "no",
            "weight_ranking": 23,
            "deciders": [
                {
                    "decider": "resize",
                    "decision": "NO",
                    "explanation": "source primary is allocated on another node"
                },
                {
                    "decider": "filter",
                    "decision": "NO",
                    "explanation": "initial allocation of the shrunken index is only allowed on nodes [_id:\"KpGc5O6eTpKT5Tr4tTW0uQ\"] that hold a copy of every shard in the index"
                }
            ]
        },
        {
            "node_id": "5FMircURTdiSWKuLC2a4zw",
            "node_name": "elasticsearch-0-es-warm-15",
            "transport_address": "10.64.143.4:9300",
            "node_attributes": {
                "k8s_node_name": "gke-eu-cluster-0-e2-gen4-bdb36591-uzmj",
                "warm": "true",
                "xpack.installed": "true",
                "zone": "europe-west1-b",
                "transform.node": "false"
            },
            "node_decision": "no",
            "weight_ranking": 24,
            "deciders": [
                {
                    "decider": "resize",
                    "decision": "NO",
                    "explanation": "source primary is allocated on another node"
                },
                {
                    "decider": "filter",
                    "decision": "NO",
                    "explanation": "initial allocation of the shrunken index is only allowed on nodes [_id:\"KpGc5O6eTpKT5Tr4tTW0uQ\"] that hold a copy of every shard in the index"
                }
            ]
        },
        {
            "node_id": "UJwokT4-Rc-g5Jy87BYTcg",
            "node_name": "elasticsearch-0-es-warm-7",
            "transport_address": "10.64.159.5:9300",
            "node_attributes": {
                "k8s_node_name": "gke-eu-cluster-0-e2-gen4-2898672c-iiky",
                "warm": "true",
                "xpack.installed": "true",
                "zone": "europe-west1-c",
                "transform.node": "false"
            },
            "node_decision": "no",
            "weight_ranking": 29,
            "deciders": [
                {
                    "decider": "filter",
                    "decision": "NO",
                    "explanation": "initial allocation of the shrunken index is only allowed on nodes [_id:\"KpGc5O6eTpKT5Tr4tTW0uQ\"] that hold a copy of every shard in the index"
                }
            ]
        },
        {
            "node_id": "9OGUbZEIRzqOPsRgEWnJQw",
            "node_name": "elasticsearch-0-es-warm-4",
            "transport_address": "10.64.179.4:9300",
            "node_attributes": {
                "k8s_node_name": "gke-eu-cluster-0-e2-gen4-2898672c-3jre",
                "warm": "true",
                "xpack.installed": "true",
                "zone": "europe-west1-c",
                "transform.node": "false"
            },
            "node_decision": "no",
            "weight_ranking": 30,
            "deciders": [
                {
                    "decider": "resize",
                    "decision": "NO",
                    "explanation": "source primary is allocated on another node"
                },
                {
                    "decider": "filter",
                    "decision": "NO",
                    "explanation": "initial allocation of the shrunken index is only allowed on nodes [_id:\"KpGc5O6eTpKT5Tr4tTW0uQ\"] that hold a copy of every shard in the index"
                }
            ]
        },
        {
            "node_id": "Kqu5vRSJQL2_886fTYxOJw",
            "node_name": "elasticsearch-0-es-warm-18",
            "transport_address": "10.64.158.4:9300",
            "node_attributes": {
                "k8s_node_name": "gke-eu-cluster-0-e2-gen4-2898672c-qsr0",
                "warm": "true",
                "xpack.installed": "true",
                "zone": "europe-west1-c",
                "transform.node": "false"
            },
            "node_decision": "no",
            "weight_ranking": 31,
            "deciders": [
                {
                    "decider": "resize",
                    "decision": "NO",
                    "explanation": "source primary is allocated on another node"
                },
                {
                    "decider": "filter",
                    "decision": "NO",
                    "explanation": "initial allocation of the shrunken index is only allowed on nodes [_id:\"KpGc5O6eTpKT5Tr4tTW0uQ\"] that hold a copy of every shard in the index"
                }
            ]
        },
        {
            "node_id": "rcb_e-0BQ_yuMaszbRHeuA",
            "node_name": "elasticsearch-0-es-warm-16",
            "transport_address": "10.64.154.5:9300",
            "node_attributes": {
                "k8s_node_name": "gke-eu-cluster-0-e2-gen4-2898672c-1j4s",
                "warm": "true",
                "xpack.installed": "true",
                "zone": "europe-west1-c",
                "transform.node": "false"
            },
            "node_decision": "no",
            "weight_ranking": 32,
            "deciders": [
                {
                    "decider": "resize",
                    "decision": "NO",
                    "explanation": "source primary is allocated on another node"
                },
                {
                    "decider": "filter",
                    "decision": "NO",
                    "explanation": "initial allocation of the shrunken index is only allowed on nodes [_id:\"KpGc5O6eTpKT5Tr4tTW0uQ\"] that hold a copy of every shard in the index"
                }
            ]
        },
        {
            "node_id": "45jcOptdTjqnA_fG1punmg",
            "node_name": "elasticsearch-0-es-warm-0",
            "transport_address": "10.64.131.7:9300",
            "node_attributes": {
                "k8s_node_name": "gke-eu-cluster-0-e2-gen4-bdb36591-blqz",
                "warm": "true",
                "xpack.installed": "true",
                "zone": "europe-west1-b",
                "transform.node": "false"
            },
            "node_decision": "no",
            "weight_ranking": 33,
            "deciders": [
                {
                    "decider": "resize",
                    "decision": "NO",
                    "explanation": "source primary is allocated on another node"
                },
                {
                    "decider": "filter",
                    "decision": "NO",
                    "explanation": "initial allocation of the shrunken index is only allowed on nodes [_id:\"KpGc5O6eTpKT5Tr4tTW0uQ\"] that hold a copy of every shard in the index"
                }
            ]
        },
        {
            "node_id": "KpGc5O6eTpKT5Tr4tTW0uQ",
            "node_name": "elasticsearch-0-es-warm-6",
            "transport_address": "10.64.142.4:9300",
            "node_attributes": {
                "k8s_node_name": "gke-eu-cluster-0-e2-gen4-bdb36591-94ar",
                "warm": "true",
                "xpack.installed": "true",
                "zone": "europe-west1-b",
                "transform.node": "false"
            },
            "node_decision": "no",
            "weight_ranking": 34,
            "deciders": [
                {
                    "decider": "resize",
                    "decision": "NO",
                    "explanation": "source primary is allocated on another node"
                }
            ]
        },
        {
            "node_id": "enaoon0EQZ2UYUoeSZmaDA",
            "node_name": "elasticsearch-0-es-warm-14",
            "transport_address": "10.64.175.6:9300",
            "node_attributes": {
                "k8s_node_name": "gke-eu-cluster-0-e2-gen4-22399c39-ky6l",
                "warm": "true",
                "xpack.installed": "true",
                "zone": "europe-west1-d",
                "transform.node": "false"
            },
            "node_decision": "no",
            "weight_ranking": 35,
            "deciders": [
                {
                    "decider": "resize",
                    "decision": "NO",
                    "explanation": "source primary is allocated on another node"
                },
                {
                    "decider": "filter",
                    "decision": "NO",
                    "explanation": "initial allocation of the shrunken index is only allowed on nodes [_id:\"KpGc5O6eTpKT5Tr4tTW0uQ\"] that hold a copy of every shard in the index"
                }
            ]
        }
    ]
}
/_cat/shards/v60.agentprocessevent@1m-001820
v60.agentprocessevent@1m-001820 28 p STARTED 4339414 3.5gb 10.64.154.5 elasticsearch-0-es-warm-16
v60.agentprocessevent@1m-001820 28 r STARTED 4339414 3.5gb 10.64.142.4 elasticsearch-0-es-warm-6
v60.agentprocessevent@1m-001820 38 r STARTED 4338675 3.5gb 10.64.161.5 elasticsearch-0-es-warm-10
v60.agentprocessevent@1m-001820 38 p STARTED 4338675 3.5gb 10.64.142.4 elasticsearch-0-es-warm-6
v60.agentprocessevent@1m-001820 41 p STARTED 4338044 3.5gb 10.64.130.4 elasticsearch-0-es-warm-19
v60.agentprocessevent@1m-001820 41 r STARTED 4338044 3.5gb 10.64.142.4 elasticsearch-0-es-warm-6
v60.agentprocessevent@1m-001820 34 r STARTED 4338454 3.5gb 10.64.142.4 elasticsearch-0-es-warm-6
v60.agentprocessevent@1m-001820 34 p STARTED 4338454 3.5gb 10.64.179.4 elasticsearch-0-es-warm-4
v60.agentprocessevent@1m-001820 11 r STARTED 4339328 3.5gb 10.64.130.4 elasticsearch-0-es-warm-19
v60.agentprocessevent@1m-001820 11 p STARTED 4339328 3.5gb 10.64.142.4 elasticsearch-0-es-warm-6
v60.agentprocessevent@1m-001820 9  p STARTED 4339990 3.5gb 10.64.130.4 elasticsearch-0-es-warm-19
v60.agentprocessevent@1m-001820 9  r STARTED 4339990 3.5gb 10.64.142.4 elasticsearch-0-es-warm-6
v60.agentprocessevent@1m-001820 8  p STARTED 4341523 3.5gb 10.64.164.3 elasticsearch-0-es-warm-13
v60.agentprocessevent@1m-001820 8  r STARTED 4341523 3.5gb 10.64.142.4 elasticsearch-0-es-warm-6
v60.agentprocessevent@1m-001820 33 r STARTED 4341562 3.4gb 10.64.175.6 elasticsearch-0-es-warm-14
v60.agentprocessevent@1m-001820 33 p STARTED 4341562 3.4gb 10.64.142.4 elasticsearch-0-es-warm-6
v60.agentprocessevent@1m-001820 29 r STARTED 4337855 3.4gb 10.64.142.4 elasticsearch-0-es-warm-6
v60.agentprocessevent@1m-001820 29 p STARTED 4337855 3.4gb 10.64.176.5 elasticsearch-0-es-warm-11
v60.agentprocessevent@1m-001820 14 r STARTED 4335073 3.5gb 10.64.142.4 elasticsearch-0-es-warm-6
v60.agentprocessevent@1m-001820 14 p STARTED 4335073 3.5gb 10.64.176.5 elasticsearch-0-es-warm-11
v60.agentprocessevent@1m-001820 2  r STARTED 4340165 3.5gb 10.64.158.4 elasticsearch-0-es-warm-18
v60.agentprocessevent@1m-001820 2  p STARTED 4340165 3.5gb 10.64.142.4 elasticsearch-0-es-warm-6
v60.agentprocessevent@1m-001820 36 r STARTED 4341766 3.5gb 10.64.169.6 elasticsearch-0-es-warm-5
v60.agentprocessevent@1m-001820 36 p STARTED 4341766 3.5gb 10.64.142.4 elasticsearch-0-es-warm-6
v60.agentprocessevent@1m-001820 13 p STARTED 4336710 3.5gb 10.64.171.5 elasticsearch-0-es-warm-8
v60.agentprocessevent@1m-001820 13 r STARTED 4336710 3.5gb 10.64.142.4 elasticsearch-0-es-warm-6
v60.agentprocessevent@1m-001820 15 r STARTED 4336604 3.5gb 10.64.154.5 elasticsearch-0-es-warm-16
v60.agentprocessevent@1m-001820 15 p STARTED 4336604 3.5gb 10.64.142.4 elasticsearch-0-es-warm-6
v60.agentprocessevent@1m-001820 35 r STARTED 4336155 3.4gb 10.64.164.3 elasticsearch-0-es-warm-13
v60.agentprocessevent@1m-001820 35 p STARTED 4336155 3.4gb 10.64.142.4 elasticsearch-0-es-warm-6
v60.agentprocessevent@1m-001820 10 r STARTED 4339246 3.5gb 10.64.164.3 elasticsearch-0-es-warm-13
v60.agentprocessevent@1m-001820 10 p STARTED 4339246 3.5gb 10.64.142.4 elasticsearch-0-es-warm-6
v60.agentprocessevent@1m-001820 37 r STARTED 4343104 3.5gb 10.64.158.4 elasticsearch-0-es-warm-18
v60.agentprocessevent@1m-001820 37 p STARTED 4343104 3.5gb 10.64.142.4 elasticsearch-0-es-warm-6
v60.agentprocessevent@1m-001820 16 p STARTED 4337849 3.5gb 10.64.175.6 elasticsearch-0-es-warm-14
v60.agentprocessevent@1m-001820 16 r STARTED 4337849 3.5gb 10.64.142.4 elasticsearch-0-es-warm-6
v60.agentprocessevent@1m-001820 5  r STARTED 4339234 3.5gb 10.64.175.6 elasticsearch-0-es-warm-14
v60.agentprocessevent@1m-001820 5  p STARTED 4339234 3.5gb 10.64.142.4 elasticsearch-0-es-warm-6
v60.agentprocessevent@1m-001820 1  p STARTED 4339507 3.5gb 10.64.174.5 elasticsearch-0-es-warm-2
v60.agentprocessevent@1m-001820 1  r STARTED 4339507 3.5gb 10.64.142.4 elasticsearch-0-es-warm-6
v60.agentprocessevent@1m-001820 21 p STARTED 4337037 3.5gb 10.64.182.6 elasticsearch-0-es-warm-17
v60.agentprocessevent@1m-001820 21 r STARTED 4337037 3.5gb 10.64.142.4 elasticsearch-0-es-warm-6
v60.agentprocessevent@1m-001820 20 r STARTED 4337287 3.5gb 10.64.174.5 elasticsearch-0-es-warm-2
v60.agentprocessevent@1m-001820 20 p STARTED 4337287 3.5gb 10.64.142.4 elasticsearch-0-es-warm-6
v60.agentprocessevent@1m-001820 17 r STARTED 4337103 3.5gb 10.64.142.4 elasticsearch-0-es-warm-6
v60.agentprocessevent@1m-001820 17 p STARTED 4337103 3.5gb 10.64.176.5 elasticsearch-0-es-warm-11
v60.agentprocessevent@1m-001820 26 p STARTED 4338520 3.5gb 10.64.154.5 elasticsearch-0-es-warm-16
v60.agentprocessevent@1m-001820 26 r STARTED 4338520 3.5gb 10.64.142.4 elasticsearch-0-es-warm-6
v60.agentprocessevent@1m-001820 4  r STARTED 4339203 3.5gb 10.64.175.6 elasticsearch-0-es-warm-14
v60.agentprocessevent@1m-001820 4  p STARTED 4339203 3.5gb 10.64.142.4 elasticsearch-0-es-warm-6
v60.agentprocessevent@1m-001820 7  p STARTED 4336603 3.5gb 10.64.159.5 elasticsearch-0-es-warm-7
v60.agentprocessevent@1m-001820 7  r STARTED 4336603 3.5gb 10.64.142.4 elasticsearch-0-es-warm-6
v60.agentprocessevent@1m-001820 27 r STARTED 4337862 3.5gb 10.64.175.6 elasticsearch-0-es-warm-14
v60.agentprocessevent@1m-001820 27 p STARTED 4337862 3.5gb 10.64.142.4 elasticsearch-0-es-warm-6
v60.agentprocessevent@1m-001820 19 p STARTED 4337474 3.5gb 10.64.142.4 elasticsearch-0-es-warm-6
v60.agentprocessevent@1m-001820 19 r STARTED 4337474 3.5gb 10.64.176.5 elasticsearch-0-es-warm-11
v60.agentprocessevent@1m-001820 3  p STARTED 4336283 3.5gb 10.64.169.6 elasticsearch-0-es-warm-5
v60.agentprocessevent@1m-001820 3  r STARTED 4336283 3.5gb 10.64.142.4 elasticsearch-0-es-warm-6
v60.agentprocessevent@1m-001820 12 p STARTED 4341266 3.5gb 10.64.159.5 elasticsearch-0-es-warm-7
v60.agentprocessevent@1m-001820 12 r STARTED 4341266 3.5gb 10.64.142.4 elasticsearch-0-es-warm-6
v60.agentprocessevent@1m-001820 40 p STARTED 4335945 3.5gb 10.64.159.5 elasticsearch-0-es-warm-7
v60.agentprocessevent@1m-001820 40 r STARTED 4335945 3.5gb 10.64.142.4 elasticsearch-0-es-warm-6
v60.agentprocessevent@1m-001820 18 r STARTED 4335919 3.5gb 10.64.142.4 elasticsearch-0-es-warm-6
v60.agentprocessevent@1m-001820 18 p STARTED 4335919 3.5gb 10.64.179.4 elasticsearch-0-es-warm-4
v60.agentprocessevent@1m-001820 24 p STARTED 4340815 3.5gb 10.64.169.6 elasticsearch-0-es-warm-5
v60.agentprocessevent@1m-001820 24 r STARTED 4340815 3.5gb 10.64.142.4 elasticsearch-0-es-warm-6
v60.agentprocessevent@1m-001820 31 p STARTED 4334595 3.4gb 10.64.161.5 elasticsearch-0-es-warm-10
v60.agentprocessevent@1m-001820 31 r STARTED 4334595 3.4gb 10.64.142.4 elasticsearch-0-es-warm-6
v60.agentprocessevent@1m-001820 32 r STARTED 4339671 3.5gb 10.64.175.6 elasticsearch-0-es-warm-14
v60.agentprocessevent@1m-001820 32 p STARTED 4339671 3.5gb 10.64.142.4 elasticsearch-0-es-warm-6
v60.agentprocessevent@1m-001820 22 r STARTED 4337912 3.5gb 10.64.174.5 elasticsearch-0-es-warm-2
v60.agentprocessevent@1m-001820 22 p STARTED 4337912 3.5gb 10.64.142.4 elasticsearch-0-es-warm-6
v60.agentprocessevent@1m-001820 25 p STARTED 4337102 3.5gb 10.64.130.4 elasticsearch-0-es-warm-19
v60.agentprocessevent@1m-001820 25 r STARTED 4337102 3.5gb 10.64.142.4 elasticsearch-0-es-warm-6
v60.agentprocessevent@1m-001820 39 r STARTED 4340482 3.5gb 10.64.171.5 elasticsearch-0-es-warm-8
v60.agentprocessevent@1m-001820 39 p STARTED 4340482 3.5gb 10.64.142.4 elasticsearch-0-es-warm-6
v60.agentprocessevent@1m-001820 23 p STARTED 4338374 3.5gb 10.64.142.4 elasticsearch-0-es-warm-6
v60.agentprocessevent@1m-001820 23 r STARTED 4338374 3.5gb 10.64.179.4 elasticsearch-0-es-warm-4
v60.agentprocessevent@1m-001820 30 r STARTED 4335923 3.5gb 10.64.152.5 elasticsearch-0-es-warm-1
v60.agentprocessevent@1m-001820 30 p STARTED 4335923 3.5gb 10.64.142.4 elasticsearch-0-es-warm-6
v60.agentprocessevent@1m-001820 6  r STARTED 4336619 3.5gb 10.64.154.5 elasticsearch-0-es-warm-16
v60.agentprocessevent@1m-001820 6  p STARTED 4336619 3.5gb 10.64.142.4 elasticsearch-0-es-warm-6
v60.agentprocessevent@1m-001820 0  r STARTED 4337113 3.5gb 10.64.130.4 elasticsearch-0-es-warm-19
v60.agentprocessevent@1m-001820 0  p STARTED 4337113 3.5gb 10.64.142.4 elasticsearch-0-es-warm-6

I tried manually rerouting primary shards to es-warm-6, however cluster reroute failed because there was already an active replica and if I tried to reroute that shard it wouldn't allow it due to the require._id allocation setting set by shrink.

I then tried restarting the shrink by deleting the (currently broken) target index, removing the require._id allocation setting and manually moving ILM to the set-single-node-allocation step. This picked a new node for shrink and shuffled all the shards onto it as normal, however when it tried to shrink it hit exactly the same error. In the end I've just removed the ILM policy from this index and deleted the broken target in order to restore the cluster to green health.

I've had a look at the code and found that this error is coming from the ResizeAllocationDecider , if I understand the logic correctly then this seems wrong as a shrink operation only requires a copy of each shard of an index to be present on the target node, not necessarily a primary shard, but I'm particularly confused by why this isn't encountered more frequently (this is quite a large+busy cluster, and does a lot of rollover and shrink).

3 Likes

Finally got to the bottom of the issue!! it's quite convoluted but the key is that error only happens if the resized index has the same number of shards as the source. This can happen for a few different reasons (in our case we were hitting a bug that was fixed in v7.17.5) but in the case of the original post, I think you have set rollover and shrink to the same size, meaning after rollover the number of shards it tries to shrink the index to will be the same. In order to actually shrink, the number of shards in the source index must be an integer multiple of the desired shards in the target (e.g. if you have an index with 6 shards, you could shrink it to 1, 2, or 3 shards, but there's no need to shrink at all if if the shard size is being kept the same). If you have 3 shards at ingest then to get a target shard size of 50 gb after shrink, you'd want them to be around 15 gb at rollover (need to be a bit careful to add a little slack here, otherwise it won't be able to shrink if it goes slightly over the rollover limit).

In a more recent ES version, ILM has now been fixed to not do anything if the source and target indices have the same number of shards, so you wouldn't hit this issue, but it seems using the shrink API directly is unsafe and could lead to this bug, would need to test on a newer ES version.

1 Like

Yes, we set it up to shrink to the same number of shards as before shrinking. So our workaround was to not shrink it, only apply "read only" when rolling over to warm phase.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.