ILM stalled by waiting for shard to be active

Hello!

I have an ILM Policy configured but it stops at rollover phase.

I checked one of the indexes with this query:

GET /apm-7.10.0-error-000001/_ilm/explain?human

And the response is:

{
  "indices" : {
    "apm-7.10.0-error-000001" : {
      "index" : "apm-7.10.0-error-000001",
      "managed" : true,
      "policy" : "apm-rollover-30-days",
      "lifecycle_date" : "2021-01-13T22:25:52.555Z",
      "lifecycle_date_millis" : 1610576752555,
      "age" : "181.36d",
      "phase" : "warm",
      "phase_time" : "2021-02-12T22:34:14.211Z",
      "phase_time_millis" : 1613169254211,
      "action" : "migrate",
      "action_time" : "2021-02-12T22:35:00.176Z",
      "action_time_millis" : 1613169300176,
      "step" : "check-migration",
      "step_time" : "2021-02-12T22:35:17.987Z",
      "step_time_millis" : 1613169317987,
      "step_info" : {
        "message" : "Waiting for all shard copies to be active",
        "shards_left_to_allocate" : -1,
        "all_shards_active" : false,
        "number_of_replicas" : 1
      },
      "phase_execution" : {
        "policy" : "apm-rollover-30-days",
        "phase_definition" : {
          "min_age" : "14d",
          "actions" : {
            "readonly" : { },
            "set_priority" : {
              "priority" : 50
            }
          }
        },
        "version" : 3,
        "modified_date" : "2021-05-27T23:52:01.125Z",
        "modified_date_in_millis" : 1622159521125
      }
    }
  }
}

Then I run this query:

GET _cluster/allocation/explain

And I received this response:

{
  "index" : "apm-7.10.0-span-000009",
  "shard" : 0,
  "primary" : false,
  "current_state" : "unassigned",
  "unassigned_info" : {
    "reason" : "CLUSTER_RECOVERED",
    "at" : "2021-07-06T16:01:24.109Z",
    "last_allocation_status" : "no_attempt"
  },
  "can_allocate" : "no",
  "allocate_explanation" : "cannot allocate because allocation is not permitted to any of the nodes",
  "node_allocation_decisions" : [
    {
      "node_id" : "jwbdx5IzTfeqlA3r7Rnmlg",
      "node_name" : "b6be8b05ea96",
      "transport_address" : "172.18.0.4:9300",
      "node_attributes" : {
        "ml.machine_memory" : "3221225472",
        "xpack.installed" : "true",
        "transform.node" : "true",
        "ml.max_open_jobs" : "20"
      },
      "node_decision" : "no",
      "deciders" : [
        {
          "decider" : "same_shard",
          "decision" : "NO",
          "explanation" : "a copy of this shard is already allocated to this node [[apm-7.10.0-span-000009][0], node[jwbdx5IzTfeqlA3r7Rnmlg], [P], s[STARTED], a[id=RF-1z_7OQxeixisbtoe06Q]]"
        }
      ]
    }
  ]
}

Can you help me understand what I am looking at, and why my index lifecycle policy is not working? Can it be related to not having enough disk space for all the shards?

Thanks!

Hey,

very high level guess from the data: are you running a single node, but have shards configured to have a replica?

--Alex

Indeed that's the case! I wasn't aware that there can be only one shard copy per node. Decider explanation makes sense now.

I think in my case it will be better to change settings to have no replica shards. Will changing the configuration of indexes be enough to solve the issue, or should I do something else to handle those unallocated shards?

changing the configuration in the indices is enough to then activate the ILM policy. However you probably would like to have future index creations also work, then you would need to adapt the index template for the number of replicas.

1 Like