ILM Indices Blocking Master Queue - All Operations Timeout

Our ES cluster has blocked master queue for 12+ days. All cluster state change operations timeout after 30s, including:

  • Index open operations

  • Index delete operations

  • ILM move commands

  • ILM stop/start commands

3 closed ILM system indices (deprecation logs and ILM history) have stuck ILM retry tasks at the front of the master queue (positions 1-3), blocking all subsequent operations:

.ds-ilm-history-5-2024.02.02-000001 (CLOSED, 787 days old)
└─ ILM history tracking index (system-generated)

.ds-.logs-deprecation.elasticsearch-default-2024.02.02-000001 (CLOSED, 787 days old)└─ Elasticsearch deprecation logs (system-generated)

.ds-.logs-deprecation.elasticsearch-default-2024.02.02-000026 (CLOSED, 787 days old)└─ Elasticsearch deprecation logs (system-generated)

These ILM system indices have policies configured for traditional indices (expecting rollover_alias), but they're data stream indices (.ds- prefix) which don't use aliases. ILM fails with:

setting [index.lifecycle.rollover_alias] for index [...] is empty or not defined

Policies involved:

  • .deprecation-indexing-ilm-policy (for deprecation log data streams)

  • ilm-history-ilm-policy (for ILM history data streams)

Both policies have rollover configurations expecting alias-based indices rather than data streams.

Pending Tasks Queue

{
  "insert_order": 78364,
  "source": "ilm-retry-failed-step {policy [.deprecation-indexing-ilm-policy], index [.ds-.logs-deprecation.elasticsearch-default-2024.02.02-000026], failedStep [{\"phase\":\"hot\",\"action\":\"rollover\",\"name\":\"check-rollover-ready\"}]}",
  "time_in_queue": "12.4d"
},
{
  "insert_order": 78365,
  "source": "ilm-retry-failed-step {policy [.deprecation-indexing-ilm-policy], index [.ds-.logs-deprecation.elasticsearch-default-2024.02.02-000001], failedStep [{\"phase\":\"hot\",\"action\":\"rollover\",\"name\":\"check-rollover-ready\"}]}",
  "time_in_queue": "12.4d"
},
{
  "insert_order": 78366,
  "source": "ilm-retry-failed-step {policy [ilm-history-ilm-policy], index [.ds-ilm-history-5-2024.02.02-000001], failedStep [{\"phase\":\"hot\",\"action\":\"rollover\",\"name\":\"check-rollover-ready\"}]}",
  "time_in_queue": "12.4d"
},

Additional affected ILM system indices (OPEN, but blocked by the closed ones):

.ds-ilm-history-5-2025.10.24-000066 (OPEN, empty - 0 docs)
.kibana-event-log-7.17.5-000012 (OPEN, empty - 0 docs)
.ds-.logs-deprecation.elasticsearch-default-2026.03.08-000076 (OPEN, 1 doc)

What We've Tried

1. ILM Move API on Closed Indices (Failed)
POST /_ilm/move/.ds-.logs-deprecation.elasticsearch-default-2024.02.02-000026

{
  "current_step": {"phase": "hot", "action": "rollover", "name": "check-rollover-ready"},
  "next_step": {"phase": "hot", "action": "complete", "name": "complete"}
}

Result: index_closed_exception - Can't read ILM state from closed index

2. ILM Move API on Open Indices (Failed)

3. Open Closed Indices (Failed)


{
 "error": {
    "type": "process_cluster_event_timeout_exception",
    "reason": "failed to process cluster event (open-indices [...]) within 30s"
  },
  "status": 503
}

4. Delete Closed Indices (Failed)


{
  "error": {
    "type": "process_cluster_event_timeout_exception",
    "reason": "failed to process cluster event (delete-index [...]) within 30s"
  },
  "status": 503
}

Index Settings (one of the problematic ILM system indices):


{
  "index.lifecycle.name": ".deprecation-indexing-ilm-policy",
  "index.verified_before_close": "true",
  "index.hidden": "true",
  "index.number_of_shards": "1",
  "index.number_of_replicas": "1"
}

Note: No index.lifecycle.rollover_alias setting (because it's a data stream index).

ILM Explain (showing 53,779 failed retry attempts):

{

  "index": ".ds-.logs-deprecation.elasticsearch-default-2024.02.02-000026",
  "managed": true,
  "policy": ".deprecation-indexing-ilm-policy",
  "step": "ERROR",
  "failed_step": "check-rollover-ready",
  "is_auto_retryable_error": true,
  "failed_step_retry_count": 53779,
  "step_info": {
    "type": "illegal_argument_exception",
    "reason": "setting [index.lifecycle.rollover_alias] for index [.ds-.logs-deprecation.elasticsearch-default-2024.02.02-000026] is empty or not defined"
  }
}

Questions

  1. How can we break this deadlock when even DELETE operations timeout?

  2. Is master node restart help in this case? If so, how do we prevent the same ILM system index tasks from re-queueing on startup?

  3. Can we bypass the master queue for delete operations on ILM system indices, or is there a force-delete mechanism?

  4. Why were the ILM system indices (.logs-deprecation, ilm-history) configured with policies expecting rollover_alias?

  5. Why do closed system indices block the entire master queue indefinitely? Shouldn't the master skip/fail closed index operations quickly rather than blocking?

Impact

  • ILM Policy Updates: Cannot update ILM policies for business indices (5min timeout)

  • Cluster Administration: All cluster state changes blocked

  • Business Operations: Unaffected (read/write to business indices works fine)

  • Risk: Queue continues growing; cluster may eventually become unstable

Additional Context

  • Cluster health is GREEN, all data nodes healthy

  • Only master queue operations are affected

  • The 3 closed system indices are 787-day-old logs

  • Current ES version: 8.14.1

  • Earlier in 2024 when those indices created we were using 7.17 es version

It looks like you have deleted important fields from this response, making it impossible to answer your question. The expected fields come from here:

This version is pretty old. Please upgrade to at least the latest 8.x version ASAP. If the problem persists, we can dig deeper, but for now I suspect you’re hitting a bug that has since been fixed.

@DavidTurner Thank you for looking into this, I am sharing the complete response of pending_tasks api

 {
      "insert_order" : 78364,
      "priority" : "NORMAL",
      "source" : "ilm-retry-failed-step {policy [.deprecation-indexing-ilm-policy], index [.ds-.logs-deprecation.elasticsearch-default-2024.02.02-000026], failedStep [{\"phase\":\"hot\",\"action\":\"rollover\",\"name\":\"check-rollover-ready\"}]}",
      "executing" : false,
      "time_in_queue_millis" : 1098701277,
      "time_in_queue" : "12.7d"
    },
    {
      "insert_order" : 78365,
      "priority" : "NORMAL",
      "source" : "ilm-retry-failed-step {policy [.deprecation-indexing-ilm-policy], index [.ds-.logs-deprecation.elasticsearch-default-2024.02.02-000001], failedStep [{\"phase\":\"hot\",\"action\":\"rollover\",\"name\":\"check-rollover-ready\"}]}",
      "executing" : false,
      "time_in_queue_millis" : 1098701277,
      "time_in_queue" : "12.7d"
    },
    {
      "insert_order" : 78366,
      "priority" : "NORMAL",
      "source" : "ilm-retry-failed-step {policy [ilm-history-ilm-policy], index [.ds-ilm-history-5-2024.02.02-000001], failedStep [{\"phase\":\"hot\",\"action\":\"rollover\",\"name\":\"check-rollover-ready\"}]}",
      "executing" : false,
      "time_in_queue_millis" : 1098701277,
      "time_in_queue" : "12.7d"
    },
    {
      "insert_order" : 78367,
      "priority" : "NORMAL",
      "source" : "ilm-set-step-info {policy [ilm-history-ilm-policy], index [.ds-ilm-history-5-2025.10.24-000066], currentStep [{\"phase\":\"hot\",\"action\":\"rollover\",\"name\":\"check-rollover-ready\"}]}",
      "executing" : false,
      "time_in_queue_millis" : 1098701277,
      "time_in_queue" : "12.7d"
    },
    {
      "insert_order" : 78368,
      "priority" : "NORMAL",
      "source" : "ilm-set-step-info {policy [kibana-event-log-policy], index [.kibana-event-log-7.17.5-000012], currentStep [{\"phase\":\"hot\",\"action\":\"rollover\",\"name\":\"check-rollover-ready\"}]}",
      "executing" : false,
      "time_in_queue_millis" : 1098701277,
      "time_in_queue" : "12.7d"
    },
    {
      "insert_order" : 78369,
      "priority" : "NORMAL",
      "source" : "ilm-set-step-info {policy [.deprecation-indexing-ilm-policy], index [.ds-.logs-deprecation.elasticsearch-default-2026.03.08-000076], currentStep [{\"phase\":\"hot\",\"action\":\"rollover\",\"name\":\"check-rollover-ready\"}]}",
      "executing" : false,
      "time_in_queue_millis" : 1098701277,
      "time_in_queue" : "12.7d"
    },
    {
      "insert_order" : 93402,
      "priority" : "NORMAL",
      "source" : "rollover_index source [.ds-ilm-history-7-2026.03.24-000037] to target [.ds-ilm-history-7-2026.03.24-000037]",
      "executing" : false,
      "time_in_queue_millis" : 578212374,
      "time_in_queue" : "6.6d"
    }

2 more indexes got added in the queue recently

 {
      "insert_order" : 87633,
      "priority" : "URGENT",
      "source" : "delete-index [[.ds-ilm-history-7-2025.12.16-000012/u_FwAkCIT6O24jHVzuV3-g]]",
      "executing" : false,
      "time_in_queue_millis" : 667310608,
      "time_in_queue" : "7.7d"
    },
    {
      "insert_order" : 173265,
      "priority" : "URGENT",
      "source" : "delete-index [[.ds-ilm-history-7-2025.12.23-000013/3qIqfiKtQ8Oc1X7XVk4HBQ]]",
      "executing" : false,
      "time_in_queue_millis" : 62223523,
      "time_in_queue" : "17.2h"
    },

If you need any other details, please let me know, I can share.

We can’t upgrade the version right away, is there any work around for this problem?

Also, trigger ilm/explain api on one of the indices :

Full ILM Explain Output (showing 53,779 failed retry attempts):

{
  "indices": {
    ".ds-.logs-deprecation.elasticsearch-default-2024.02.02-000026": {
      "index": ".ds-.logs-deprecation.elasticsearch-default-2024.02.02-000026",
      "managed": true,
      "policy": ".deprecation-indexing-ilm-policy",
      "index_creation_date_millis": 1706875024155,
      "time_since_index_creation": "787.87d",
      "lifecycle_date_millis": 1706875024155,
      "age": "787.87d",
      "phase": "hot",
      "phase_time_millis": 1773871448834,
      "action": "rollover",
      "action_time_millis": 1706875025202,
      "step": "ERROR",
      "step_time_millis": 1773872048904,
      "failed_step": "check-rollover-ready",
      "is_auto_retryable_error": true,
      "failed_step_retry_count": 53779,
      "step_info": {
        "type": "illegal_argument_exception",
        "reason": "setting [index.lifecycle.rollover_alias] for index [.ds-.logs-deprecation.elasticsearch-default-2024.02.02-000026] is empty or not defined"
      },
      "phase_execution": {
        "policy": ".deprecation-indexing-ilm-policy",
        "phase_definition": {
          "min_age": "0ms",
          "actions": {
            "rollover": {
              "max_age": "30d",
              "min_docs": 1,
              "max_primary_shard_docs": 200000000,
              "max_primary_shard_size": "10gb"
            }
          }
        },
        "version": 1,
        "modified_date_in_millis": 1673159414744
      }
    }
  }
}

Looks like a bug we’ve fixed since 8.14 indeed. Restarting the master node will unblock it temporarily but the only remedy is to upgrade.

Thanks @DavidTurner for the suggestion, we will try restarting master nodes and update if that helps.