ILM Indices Blocking Master Queue - All Operations Timeout

Anup_Kumar · April 1, 2026, 7:22am

Our ES cluster has blocked master queue for 12+ days. All cluster state change operations timeout after 30s, including:

Index open operations
Index delete operations
ILM move commands
ILM stop/start commands

3 closed ILM system indices (deprecation logs and ILM history) have stuck ILM retry tasks at the front of the master queue (positions 1-3), blocking all subsequent operations:

.ds-ilm-history-5-2024.02.02-000001 (CLOSED, 787 days old)
└─ ILM history tracking index (system-generated)

.ds-.logs-deprecation.elasticsearch-default-2024.02.02-000001 (CLOSED, 787 days old)└─ Elasticsearch deprecation logs (system-generated)

.ds-.logs-deprecation.elasticsearch-default-2024.02.02-000026 (CLOSED, 787 days old)└─ Elasticsearch deprecation logs (system-generated)

These ILM system indices have policies configured for traditional indices (expecting rollover_alias), but they're data stream indices (.ds- prefix) which don't use aliases. ILM fails with:

setting [index.lifecycle.rollover_alias] for index [...] is empty or not defined

Policies involved:

.deprecation-indexing-ilm-policy (for deprecation log data streams)
ilm-history-ilm-policy (for ILM history data streams)

Both policies have rollover configurations expecting alias-based indices rather than data streams.

Pending Tasks Queue

{
  "insert_order": 78364,
  "source": "ilm-retry-failed-step {policy [.deprecation-indexing-ilm-policy], index [.ds-.logs-deprecation.elasticsearch-default-2024.02.02-000026], failedStep [{\"phase\":\"hot\",\"action\":\"rollover\",\"name\":\"check-rollover-ready\"}]}",
  "time_in_queue": "12.4d"
},
{
  "insert_order": 78365,
  "source": "ilm-retry-failed-step {policy [.deprecation-indexing-ilm-policy], index [.ds-.logs-deprecation.elasticsearch-default-2024.02.02-000001], failedStep [{\"phase\":\"hot\",\"action\":\"rollover\",\"name\":\"check-rollover-ready\"}]}",
  "time_in_queue": "12.4d"
},
{
  "insert_order": 78366,
  "source": "ilm-retry-failed-step {policy [ilm-history-ilm-policy], index [.ds-ilm-history-5-2024.02.02-000001], failedStep [{\"phase\":\"hot\",\"action\":\"rollover\",\"name\":\"check-rollover-ready\"}]}",
  "time_in_queue": "12.4d"
},

Additional affected ILM system indices (OPEN, but blocked by the closed ones):

.ds-ilm-history-5-2025.10.24-000066 (OPEN, empty - 0 docs)
.kibana-event-log-7.17.5-000012 (OPEN, empty - 0 docs)
.ds-.logs-deprecation.elasticsearch-default-2026.03.08-000076 (OPEN, 1 doc)

What We've Tried

1. ILM Move API on Closed Indices (Failed)
POST /_ilm/move/.ds-.logs-deprecation.elasticsearch-default-2024.02.02-000026

{
  "current_step": {"phase": "hot", "action": "rollover", "name": "check-rollover-ready"},
  "next_step": {"phase": "hot", "action": "complete", "name": "complete"}
}

Result: index_closed_exception - Can't read ILM state from closed index

2. ILM Move API on Open Indices (Failed)

3. Open Closed Indices (Failed)


{
 "error": {
    "type": "process_cluster_event_timeout_exception",
    "reason": "failed to process cluster event (open-indices [...]) within 30s"
  },
  "status": 503
}

4. Delete Closed Indices (Failed)


{
  "error": {
    "type": "process_cluster_event_timeout_exception",
    "reason": "failed to process cluster event (delete-index [...]) within 30s"
  },
  "status": 503
}

Index Settings (one of the problematic ILM system indices):


{
  "index.lifecycle.name": ".deprecation-indexing-ilm-policy",
  "index.verified_before_close": "true",
  "index.hidden": "true",
  "index.number_of_shards": "1",
  "index.number_of_replicas": "1"
}

Note: No index.lifecycle.rollover_alias setting (because it's a data stream index).

ILM Explain (showing 53,779 failed retry attempts):

{

  "index": ".ds-.logs-deprecation.elasticsearch-default-2024.02.02-000026",
  "managed": true,
  "policy": ".deprecation-indexing-ilm-policy",
  "step": "ERROR",
  "failed_step": "check-rollover-ready",
  "is_auto_retryable_error": true,
  "failed_step_retry_count": 53779,
  "step_info": {
    "type": "illegal_argument_exception",
    "reason": "setting [index.lifecycle.rollover_alias] for index [.ds-.logs-deprecation.elasticsearch-default-2024.02.02-000026] is empty or not defined"
  }
}

Questions

How can we break this deadlock when even DELETE operations timeout?
Is master node restart help in this case? If so, how do we prevent the same ILM system index tasks from re-queueing on startup?
Can we bypass the master queue for delete operations on ILM system indices, or is there a force-delete mechanism?
Why were the ILM system indices (.logs-deprecation, ilm-history) configured with policies expecting rollover_alias?
Why do closed system indices block the entire master queue indefinitely? Shouldn't the master skip/fail closed index operations quickly rather than blocking?

Impact

ILM Policy Updates: Cannot update ILM policies for business indices (5min timeout)
Cluster Administration: All cluster state changes blocked
Business Operations: Unaffected (read/write to business indices works fine)
Risk: Queue continues growing; cluster may eventually become unstable

Additional Context

Cluster health is GREEN, all data nodes healthy
Only master queue operations are affected
The 3 closed system indices are 787-day-old logs
Current ES version: 8.14.1
Earlier in 2024 when those indices created we were using 7.17 es version

DavidTurner · April 1, 2026, 8:01am

Anup_Kumar:

{
  "insert_order": 78364,
  "source": "ilm-retry-failed-step {policy [.deprecation-indexing-ilm-policy], index [.ds-.logs-deprecation.elasticsearch-default-2024.02.02-000026], failedStep [{\"phase\":\"hot\",\"action\":\"rollover\",\"name\":\"check-rollover-ready\"}]}",
  "time_in_queue": "12.4d"
},

It looks like you have deleted important fields from this response, making it impossible to answer your question. The expected fields come from here:

github.com/elastic/elasticsearch

server/src/main/java/org/elasticsearch/action/admin/cluster/tasks/PendingClusterTasksResponse.java

47a869af5


      
          builder.field(Fields.INSERT_ORDER, pendingClusterTask.getInsertOrder());
          builder.field(Fields.PRIORITY, pendingClusterTask.getPriority());
          builder.field(Fields.SOURCE, pendingClusterTask.getSource());
          builder.field(Fields.EXECUTING, pendingClusterTask.isExecuting());
          builder.field(Fields.TIME_IN_QUEUE_MILLIS, pendingClusterTask.getTimeInQueueInMillis());
          builder.field(Fields.TIME_IN_QUEUE, pendingClusterTask.getTimeInQueue());

This version is pretty old. Please upgrade to at least the latest 8.x version ASAP. If the problem persists, we can dig deeper, but for now I suspect you’re hitting a bug that has since been fixed.

Anup_Kumar · April 1, 2026, 8:37am

@DavidTurner Thank you for looking into this, I am sharing the complete response of pending_tasks api

 {
      "insert_order" : 78364,
      "priority" : "NORMAL",
      "source" : "ilm-retry-failed-step {policy [.deprecation-indexing-ilm-policy], index [.ds-.logs-deprecation.elasticsearch-default-2024.02.02-000026], failedStep [{\"phase\":\"hot\",\"action\":\"rollover\",\"name\":\"check-rollover-ready\"}]}",
      "executing" : false,
      "time_in_queue_millis" : 1098701277,
      "time_in_queue" : "12.7d"
    },
    {
      "insert_order" : 78365,
      "priority" : "NORMAL",
      "source" : "ilm-retry-failed-step {policy [.deprecation-indexing-ilm-policy], index [.ds-.logs-deprecation.elasticsearch-default-2024.02.02-000001], failedStep [{\"phase\":\"hot\",\"action\":\"rollover\",\"name\":\"check-rollover-ready\"}]}",
      "executing" : false,
      "time_in_queue_millis" : 1098701277,
      "time_in_queue" : "12.7d"
    },
    {
      "insert_order" : 78366,
      "priority" : "NORMAL",
      "source" : "ilm-retry-failed-step {policy [ilm-history-ilm-policy], index [.ds-ilm-history-5-2024.02.02-000001], failedStep [{\"phase\":\"hot\",\"action\":\"rollover\",\"name\":\"check-rollover-ready\"}]}",
      "executing" : false,
      "time_in_queue_millis" : 1098701277,
      "time_in_queue" : "12.7d"
    },
    {
      "insert_order" : 78367,
      "priority" : "NORMAL",
      "source" : "ilm-set-step-info {policy [ilm-history-ilm-policy], index [.ds-ilm-history-5-2025.10.24-000066], currentStep [{\"phase\":\"hot\",\"action\":\"rollover\",\"name\":\"check-rollover-ready\"}]}",
      "executing" : false,
      "time_in_queue_millis" : 1098701277,
      "time_in_queue" : "12.7d"
    },
    {
      "insert_order" : 78368,
      "priority" : "NORMAL",
      "source" : "ilm-set-step-info {policy [kibana-event-log-policy], index [.kibana-event-log-7.17.5-000012], currentStep [{\"phase\":\"hot\",\"action\":\"rollover\",\"name\":\"check-rollover-ready\"}]}",
      "executing" : false,
      "time_in_queue_millis" : 1098701277,
      "time_in_queue" : "12.7d"
    },
    {
      "insert_order" : 78369,
      "priority" : "NORMAL",
      "source" : "ilm-set-step-info {policy [.deprecation-indexing-ilm-policy], index [.ds-.logs-deprecation.elasticsearch-default-2026.03.08-000076], currentStep [{\"phase\":\"hot\",\"action\":\"rollover\",\"name\":\"check-rollover-ready\"}]}",
      "executing" : false,
      "time_in_queue_millis" : 1098701277,
      "time_in_queue" : "12.7d"
    },
    {
      "insert_order" : 93402,
      "priority" : "NORMAL",
      "source" : "rollover_index source [.ds-ilm-history-7-2026.03.24-000037] to target [.ds-ilm-history-7-2026.03.24-000037]",
      "executing" : false,
      "time_in_queue_millis" : 578212374,
      "time_in_queue" : "6.6d"
    }

2 more indexes got added in the queue recently

 {
      "insert_order" : 87633,
      "priority" : "URGENT",
      "source" : "delete-index [[.ds-ilm-history-7-2025.12.16-000012/u_FwAkCIT6O24jHVzuV3-g]]",
      "executing" : false,
      "time_in_queue_millis" : 667310608,
      "time_in_queue" : "7.7d"
    },
    {
      "insert_order" : 173265,
      "priority" : "URGENT",
      "source" : "delete-index [[.ds-ilm-history-7-2025.12.23-000013/3qIqfiKtQ8Oc1X7XVk4HBQ]]",
      "executing" : false,
      "time_in_queue_millis" : 62223523,
      "time_in_queue" : "17.2h"
    },

If you need any other details, please let me know, I can share.

We can’t upgrade the version right away, is there any work around for this problem?

Also, trigger ilm/explain api on one of the indices :

Full ILM Explain Output (showing 53,779 failed retry attempts):

{
  "indices": {
    ".ds-.logs-deprecation.elasticsearch-default-2024.02.02-000026": {
      "index": ".ds-.logs-deprecation.elasticsearch-default-2024.02.02-000026",
      "managed": true,
      "policy": ".deprecation-indexing-ilm-policy",
      "index_creation_date_millis": 1706875024155,
      "time_since_index_creation": "787.87d",
      "lifecycle_date_millis": 1706875024155,
      "age": "787.87d",
      "phase": "hot",
      "phase_time_millis": 1773871448834,
      "action": "rollover",
      "action_time_millis": 1706875025202,
      "step": "ERROR",
      "step_time_millis": 1773872048904,
      "failed_step": "check-rollover-ready",
      "is_auto_retryable_error": true,
      "failed_step_retry_count": 53779,
      "step_info": {
        "type": "illegal_argument_exception",
        "reason": "setting [index.lifecycle.rollover_alias] for index [.ds-.logs-deprecation.elasticsearch-default-2024.02.02-000026] is empty or not defined"
      },
      "phase_execution": {
        "policy": ".deprecation-indexing-ilm-policy",
        "phase_definition": {
          "min_age": "0ms",
          "actions": {
            "rollover": {
              "max_age": "30d",
              "min_docs": 1,
              "max_primary_shard_docs": 200000000,
              "max_primary_shard_size": "10gb"
            }
          }
        },
        "version": 1,
        "modified_date_in_millis": 1673159414744
      }
    }
  }
}

DavidTurner · April 1, 2026, 9:30am

Looks like a bug we’ve fixed since 8.14 indeed. Restarting the master node will unblock it temporarily but the only remedy is to upgrade.

Anup_Kumar · April 1, 2026, 10:01am

Thanks @DavidTurner for the suggestion, we will try restarting master nodes and update if that helps.

Anup_Kumar · April 2, 2026, 8:40am

@DavidTurner Restarting the master node resolved the issue, thank you for the suggestion. We plan to upgrade to a newer version going forward, but this served as a good temporary workaround.

Topic		Replies	Views
ILM action failed "check-rollover-ready,Moving to ERROR step"? Elasticsearch ilm-index-lifecycle-management	4	3823	March 10, 2022
Index lifecycle not working because of "FORBIDDEN/12/index read-only / allow delete" Elasticsearch ilm-index-lifecycle-management	4	1529	March 18, 2020
Index rollover and ILM issue Elasticsearch ilm-index-lifecycle-management	12	387	March 4, 2025
ILM not deleting index Elasticsearch ilm-index-lifecycle-management	22	2512	July 5, 2023
ILM policy not applied Elasticsearch ilm-index-lifecycle-management	5	944	June 6, 2024

ILM Indices Blocking Master Queue - All Operations Timeout

Impact

Additional Context

Related topics