ILM stuck at check-allocation on transition to Cold Node

Hi folks,

We are having an issue whereour index life cycle policy is not completing the check-allocation phase to the cold nodes. It doesn't seem to be causing any errors just waiting and never completing. Interestingly we don't have the same issue with hot to warm transition just for this transition. I am kind of at a loss on why this is getting stuck and can't find any errors.

We are running Elasticsearch 7.9.1 using the ECK operator version 1.3.0.

GET _ilm/policy/companyname-prod-us-global
{
  "companyname-prod-us-global" : {
    "version" : 11,
    "modified_date" : "2021-01-06T17:58:03.603Z",
    "policy" : {
      "phases" : {
        "warm" : {
          "min_age" : "0ms",
          "actions" : {
            "allocate" : {
              "number_of_replicas" : 0,
              "include" : { },
              "exclude" : { },
              "require" : {
                "data" : "warm"
              }
            },
            "set_priority" : {
              "priority" : 50
            }
          }
        },
        "cold" : {
          "min_age" : "30d",
          "actions" : {
            "allocate" : {
              "number_of_replicas" : 0,
              "include" : { },
              "exclude" : { },
              "require" : {
                "data" : "cold"
              }
            },
            "set_priority" : {
              "priority" : 20
            }
          }
        },
        "hot" : {
          "min_age" : "0ms",
          "actions" : {
            "rollover" : {
              "max_size" : "50gb",
              "max_age" : "1d"
            },
            "set_priority" : {
              "priority" : 100
            }
          }
        },
        "delete" : {
          "min_age" : "90d",
          "actions" : {
            "delete" : {
              "delete_searchable_snapshot" : true
            }
          }
        }
      }
    }
  }
}
GET nginx-2020.11.27-000002/_ilm/explain
{
  "indices" : {
    "nginx-2020.11.27-000002" : {
      "index" : "nginx-2020.11.27-000002",
      "managed" : true,
      "policy" : "companyname-prod-us-global",
      "lifecycle_date_millis" : 1607186458617,
      "age" : "33.33d",
      "phase" : "cold",
      "phase_time_millis" : 1608266450939,
      "action" : "allocate",
      "action_time_millis" : 1608266451951,
      "step" : "check-allocation",
      "step_time_millis" : 1608266454513,
      "step_info" : {
        "message" : "Waiting for [1] shards to be allocated to nodes matching the given filters",
        "shards_left_to_allocate" : 1,
        "all_shards_active" : true,
        "actual_replicas" : 0
      },
      "phase_execution" : {
        "policy" : "clio-prod-us-global",
        "phase_definition" : {
          "min_age" : "30d",
          "actions" : {
            "allocate" : {
              "number_of_replicas" : 0,
              "include" : { },
              "exclude" : { },
              "require" : {
                "data" : "cold"
              }
            },
            "set_priority" : {
              "priority" : 20
            }
          }
        },
        "version" : 11,
        "modified_date_in_millis" : 1609955883603
      }
    }
  }
}
GET _nodes/*cold*/settings | jq '.nodes | to_entries[] | {name: .value.name, attributes: .value.attributes}'
{
  "name": "prod-es-cold-001-1",
  "attributes": {
    "k8s_node_name": "ip-10-110-137-169.ec2.internal",
    "ml.machine_memory": "6442450944",
    "ml.max_open_jobs": "20",
    "xpack.installed": "true",
    "data": "cold",
    "transform.node": "true"
  }
}
{
  "name": "prod-es-cold-001-2",
  "attributes": {
    "k8s_node_name": "ip-10-110-130-207.ec2.internal",
    "ml.machine_memory": "6442450944",
    "ml.max_open_jobs": "20",
    "xpack.installed": "true",
    "data": "cold",
    "transform.node": "true"
  }
}
{
  "name": "prod-es-cold-001-0",
  "attributes": {
    "k8s_node_name": "ip-10-110-145-243.ec2.internal",
    "ml.machine_memory": "6442450944",
    "ml.max_open_jobs": "20",
    "xpack.installed": "true",
    "data": "cold",
    "transform.node": "true"
  }
}
GET _cat/allocation?v
shards disk.indices disk.used disk.avail disk.total disk.percent host           ip             node
    32      359.4gb   361.1gb    508.6gb    869.8gb           41 10.110.138.107 10.110.138.107 prod-es-hot-az2-002-1
    51        2.4tb     2.4tb    390.3gb      2.7tb           86 10.110.130.124 10.110.130.124 prod-es-warm-az1-002-0
     0           0b   104.3mb      2.7tb      2.7tb            0 10.110.137.136 10.110.137.136 prod-es-cold-001-1
    32      526.5gb   527.6gb    342.2gb    869.8gb           60 10.110.139.38  10.110.139.38  prod-es-hot-az2-002-0
    31      288.1gb   289.5gb    580.3gb    869.8gb           33 10.110.146.131 10.110.146.131 prod-es-hot-az3-002-1
     0           0b   104.3mb      2.7tb      2.7tb            0 10.110.146.225 10.110.146.225 prod-es-cold-001-0
    32      755.4gb   755.9gb    113.8gb    869.8gb           86 10.110.145.67  10.110.145.67  prod-es-hot-az3-002-0
    76        2.4tb     2.4tb    387.9gb      2.7tb           86 10.110.139.54  10.110.139.54  prod-es-warm-az2-002-0
    32      746.3gb   747.7gb    122.1gb    869.8gb           85 10.110.129.120 10.110.129.120 prod-es-hot-az1-002-1
    32      416.8gb   418.8gb    450.9gb    869.8gb           48 10.110.128.195 10.110.128.195 prod-es-hot-az1-002-0
     0           0b   104.3mb      2.7tb      2.7tb            0 10.110.129.48  10.110.129.48  prod-es-cold-001-2
    44        2.4tb     2.4tb    393.1gb      2.7tb           86 10.110.146.45  10.110.146.45  prod-es-warm-az3-002-0

Well after all that we finally figured it out. We weren't exposing the zone attributes for the cold storage and the previous stages of the indexes were zone routing aware.

2 Likes

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.