Our ES cluster has blocked master queue for 12+ days. All cluster state change operations timeout after 30s, including:
-
Index open operations
-
Index delete operations
-
ILM move commands
-
ILM stop/start commands
3 closed ILM system indices (deprecation logs and ILM history) have stuck ILM retry tasks at the front of the master queue (positions 1-3), blocking all subsequent operations:
.ds-ilm-history-5-2024.02.02-000001 (CLOSED, 787 days old)
└─ ILM history tracking index (system-generated)
.ds-.logs-deprecation.elasticsearch-default-2024.02.02-000001 (CLOSED, 787 days old)└─ Elasticsearch deprecation logs (system-generated)
.ds-.logs-deprecation.elasticsearch-default-2024.02.02-000026 (CLOSED, 787 days old)└─ Elasticsearch deprecation logs (system-generated)
These ILM system indices have policies configured for traditional indices (expecting rollover_alias), but they're data stream indices (.ds- prefix) which don't use aliases. ILM fails with:
setting [index.lifecycle.rollover_alias] for index [...] is empty or not defined
Policies involved:
-
.deprecation-indexing-ilm-policy(for deprecation log data streams) -
ilm-history-ilm-policy(for ILM history data streams)
Both policies have rollover configurations expecting alias-based indices rather than data streams.
Pending Tasks Queue
{
"insert_order": 78364,
"source": "ilm-retry-failed-step {policy [.deprecation-indexing-ilm-policy], index [.ds-.logs-deprecation.elasticsearch-default-2024.02.02-000026], failedStep [{\"phase\":\"hot\",\"action\":\"rollover\",\"name\":\"check-rollover-ready\"}]}",
"time_in_queue": "12.4d"
},
{
"insert_order": 78365,
"source": "ilm-retry-failed-step {policy [.deprecation-indexing-ilm-policy], index [.ds-.logs-deprecation.elasticsearch-default-2024.02.02-000001], failedStep [{\"phase\":\"hot\",\"action\":\"rollover\",\"name\":\"check-rollover-ready\"}]}",
"time_in_queue": "12.4d"
},
{
"insert_order": 78366,
"source": "ilm-retry-failed-step {policy [ilm-history-ilm-policy], index [.ds-ilm-history-5-2024.02.02-000001], failedStep [{\"phase\":\"hot\",\"action\":\"rollover\",\"name\":\"check-rollover-ready\"}]}",
"time_in_queue": "12.4d"
},
Additional affected ILM system indices (OPEN, but blocked by the closed ones):
.ds-ilm-history-5-2025.10.24-000066 (OPEN, empty - 0 docs)
.kibana-event-log-7.17.5-000012 (OPEN, empty - 0 docs)
.ds-.logs-deprecation.elasticsearch-default-2026.03.08-000076 (OPEN, 1 doc)
What We've Tried
1. ILM Move API on Closed Indices (Failed)
POST /_ilm/move/.ds-.logs-deprecation.elasticsearch-default-2024.02.02-000026
{
"current_step": {"phase": "hot", "action": "rollover", "name": "check-rollover-ready"},
"next_step": {"phase": "hot", "action": "complete", "name": "complete"}
}
Result: index_closed_exception - Can't read ILM state from closed index
2. ILM Move API on Open Indices (Failed)
3. Open Closed Indices (Failed)
{
"error": {
"type": "process_cluster_event_timeout_exception",
"reason": "failed to process cluster event (open-indices [...]) within 30s"
},
"status": 503
}
4. Delete Closed Indices (Failed)
{
"error": {
"type": "process_cluster_event_timeout_exception",
"reason": "failed to process cluster event (delete-index [...]) within 30s"
},
"status": 503
}
Index Settings (one of the problematic ILM system indices):
{
"index.lifecycle.name": ".deprecation-indexing-ilm-policy",
"index.verified_before_close": "true",
"index.hidden": "true",
"index.number_of_shards": "1",
"index.number_of_replicas": "1"
}
Note: No index.lifecycle.rollover_alias setting (because it's a data stream index).
ILM Explain (showing 53,779 failed retry attempts):
{
"index": ".ds-.logs-deprecation.elasticsearch-default-2024.02.02-000026",
"managed": true,
"policy": ".deprecation-indexing-ilm-policy",
"step": "ERROR",
"failed_step": "check-rollover-ready",
"is_auto_retryable_error": true,
"failed_step_retry_count": 53779,
"step_info": {
"type": "illegal_argument_exception",
"reason": "setting [index.lifecycle.rollover_alias] for index [.ds-.logs-deprecation.elasticsearch-default-2024.02.02-000026] is empty or not defined"
}
}
Questions
-
How can we break this deadlock when even DELETE operations timeout?
-
Is master node restart help in this case? If so, how do we prevent the same ILM system index tasks from re-queueing on startup?
-
Can we bypass the master queue for delete operations on ILM system indices, or is there a force-delete mechanism?
-
Why were the ILM system indices (.logs-deprecation, ilm-history) configured with policies expecting rollover_alias?
-
Why do closed system indices block the entire master queue indefinitely? Shouldn't the master skip/fail closed index operations quickly rather than blocking?
Impact
-
ILM Policy Updates: Cannot update ILM policies for business indices (5min timeout)
-
Cluster Administration: All cluster state changes blocked
-
Business Operations: Unaffected (read/write to business indices works fine)
-
Risk: Queue continues growing; cluster may eventually become unstable
Additional Context
-
Cluster health is GREEN, all data nodes healthy
-
Only master queue operations are affected
-
The 3 closed system indices are 787-day-old logs
-
Current ES version: 8.14.1
-
Earlier in 2024 when those indices created we were using 7.17 es version