This was also opened as a Github issue and I replied there:
ECK Elasticsearch Cluster Diagnosis Summary
Date: 2025-10-16
Cluster: quickstart (default namespace)
Version: Elasticsearch 9.1.5, ECK Operator 2.16.1
Investigation: Suspected ECK operator bug
Summary
Conclusion: Not an ECK operator bug. The operator is functioning correctly and appropriately refusing to proceed with a rolling restart due to inability to verify cluster health.
Elasticsearch node quickstart-es-default-3 is not ready due to DNS resolution failures in the Kubernetes cluster infrastructure. The Elasticsearch cluster is healthy and operational. The ECK operator correctly initiated a rolling restart but cannot complete the operation because DNS queries to Elasticsearch pod FQDNs consistently time out, preventing the operator from verifying cluster safety before proceeding.
Impact: 1 of 4 nodes marked not ready, rolling restart deadlocked for 11+ hours.
Root Cause: Kubernetes DNS infrastructure issue - not an ECK operator defect.
Investigation Findings
1. Elasticsearch Cluster Status (
Healthy)
Cluster Health:
{
"status": "green",
"number_of_nodes": 4,
"active_shards": 176,
"unassigned_shards": 0,
"active_shards_percent_as_number": 100.0
}
Node-3 Status:
- Container running normally since
2025-10-16T02:36:43Z
- Successfully joined cluster at
02:37:12Z
- No errors in Elasticsearch logs
- All cluster operations functioning correctly
Finding: Elasticsearch is healthy and not experiencing any internal issues.
2. ECK Operator Behavior (
Working as Designed)
What the operator did correctly:
- Initiated rolling restart procedure
- Called Elasticsearch Shutdown API to prepare node-3
- Node-3 successfully completed shutdown preparation (migrated shards)
- Operator attempted to verify cluster health before proceeding
- DNS queries failed - operator correctly refused to proceed
Elasticsearch Resource Status:
status:
availableNodes: 3
phase: "ApplyingChanges"
health: "unknown"
conditions:
- type: ElasticsearchIsReachable
status: "False"
message: "Service has endpoints but Elasticsearch is unavailable"
Finding: Operator is behaving correctly by not proceeding with potentially unsafe operations when it cannot verify cluster state.
3. Node Shutdown State (Expected Behavior)
Shutdown API Status:
{
"node_id": "PrRzhu4ASKad-wlj7tmX5Q",
"type": "RESTART",
"reason": "pre-stop hook",
"status": "COMPLETE",
"shard_migration": {"status": "COMPLETE"}
}
Timeline:
- Operator marked node for restart (normal orchestration)
- Node prepared for shutdown - shards migrated successfully
- Operator attempted to verify cluster health before terminating pod
- DNS resolution failed - operator waiting for connectivity
- Readiness probe fails (expected - node in shutdown mode)
Finding: The shutdown preparation completed successfully. The operator is correctly waiting to verify cluster safety before proceeding with pod termination.
4. Root Cause: DNS Infrastructure Failure (
External Issue)
Persistent DNS timeout errors from Oct 14-16:
Error: "dial tcp: lookup quickstart-es-default-[0-3].quickstart-es-default.default
on 172.20.0.10:53: read udp 10.200.0.253:XXXXX->172.20.0.10:53: i/o timeout"
DNS Configuration:
- DNS Server: 172.20.0.10:53 (CoreDNS)
- ECK Operator IP: 10.200.0.253
- Target FQDNs:
quickstart-es-default-{0,1,2,3}.quickstart-es-default.default
What's happening:
- Operator makes HTTPS request to Elasticsearch API
- Kubernetes attempts DNS lookup of pod FQDN
- DNS query times out (infrastructure failure)
- Operator cannot reach Elasticsearch API
- Operator cannot verify cluster is safe
- Operator correctly refuses to proceed
Finding: The DNS infrastructure (CoreDNS) is not responding to queries from the operator pod. This is a Kubernetes platform issue, not an ECK operator issue.
Root Cause Analysis
The operator is not at fault. The ECK operator is working as designed:
Correctly initiated rolling restart
Properly used Elasticsearch Shutdown API
Validated shutdown preparation completed
Attempted to verify cluster health before proceeding
Detected inability to reach Elasticsearch
Appropriately refused to proceed with potentially unsafe operations
Logged clear error messages indicating DNS resolution failure
The infrastructure is at fault:
Evidence of DNS Infrastructure Failure:
- DNS timeout errors span multiple days (Oct 14-16)
- Affects all Elasticsearch node FQDNs intermittently
- Consistent pattern: queries to CoreDNS (172.20.0.10:53) timeout
- Elasticsearch cluster remains healthy throughout (proving it's not an ES issue)
- No ECK operator code defects or logic errors observed
Likely DNS Infrastructure Causes:
- CoreDNS pods overloaded, unhealthy, or under-resourced
- Network policy blocking DNS queries from operator namespace
- CNI plugin issues causing packet loss between pods
- DNS service disruption or rate limiting
Conclusion
This is definitively not an ECK operator bug. The operator is functioning correctly and exhibiting appropriate defensive behavior.
The ECK operator properly:
- Uses Elasticsearch APIs according to best practices
- Implements safe orchestration by verifying cluster health before destructive operations
- Logs clear diagnostic information about connectivity failures
- Refuses to proceed when it cannot guarantee safety
The actual problem is a Kubernetes DNS infrastructure failure where queries from the ECK operator pod to CoreDNS consistently time out. The operator has no control over DNS resolution - it correctly relies on Kubernetes platform services, which are currently failing.
Operator behavior is correct: It is safer for the operator to wait indefinitely than to proceed with pod termination when cluster health cannot be verified. This prevents potential data loss or cluster instability.
Fix Required: Resolve the DNS resolution failures in the Kubernetes cluster infrastructure. This is a platform/infrastructure issue, not an application-layer (ECK) issue.
Appendix: Error Samples
DNS Timeout from Operator (Oct 16, 02:42:52):
{
"log.level": "error",
"@timestamp": "2025-10-16T02:42:52.860Z",
"error": "dial tcp: lookup quickstart-es-default-2.quickstart-es-default.default on 172.20.0.10:53: read udp 10.200.0.253:44991->172.20.0.10:53: i/o timeout"
}
Operator Correctly Detecting Unreachability:
{
"log.level": "info",
"@timestamp": "2025-10-16T02:43:31.933Z",
"message": "Elasticsearch cannot be reached yet, re-queuing",
"namespace": "default",
"es_name": "quickstart"
}
These logs demonstrate the operator is working correctly: detecting the connectivity issue, logging it clearly, and re-queuing the reconciliation rather than proceeding unsafely.
AI-generated Diagnosis Based On: ECK Diagnostics Bundle (2025-10-16T13:49:06)
================
Diagnosis seems plausible to me. I don't think we have a bug here.