Pod fails to become ready after restart

elastic operator v2.16.1 is installed in the cluster. k8s is running in Amazon EKS, k8s version 1.32

The following brings up a four node cluster with no errors:

apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
  name: quickstart
spec:
  version: 9.1.5
  nodeSets:
  - name: default
    count: 4
    config:
      node.store.allow_mmap: false
      xpack.monitoring.collection.enabled: true
    podTemplate:
      spec:
        containers:
        - name: elasticsearch
          resources:
            requests:
              memory: "8Gi"
              cpu: "500m"
            limits:
              memory: "16Gi"
              cpu: "2000m"
    volumeClaimTemplates:
    - metadata:
        name: elasticsearch-data
      spec:
        accessModes:
        - ReadWriteOnce
        resources:
          requests:
            storage: 50Gi
  podDisruptionBudget:
    spec:
      maxUnavailable: 1

All pods come up, cluster is created, all pods enter a ready state.

If a pod is rescheduled on a different node (due to a deletion, upgrade, whatever), the new pod comes up and joins the cluster. Cluster state goes green, indexes are copied, everything seems fine, but the pod never receives a ready state. The localhost TCP port 8080 is never opened, the readiness check never passes.
No errors in the log, but also no INFO level log message from the ReadinessService.

Attached is the eck-diagnostics output. At the time of running, quickstart-es-default-3 was in a non-ready state from a k8s perspective, but the cluster was healthy with four nodes.

quickstart-es-default-0          1/1     Running   0             44m
quickstart-es-default-1          1/1     Running   0             44m
quickstart-es-default-2          1/1     Running   0             44m
quickstart-es-default-3          0/1     Running   0             40m
quickstart-kb-66d4b9b97b-tzrbt   1/1     Running   0             44m
{
  "cluster_name": "quickstart",
  "status": "green",
  "timed_out": false,
  "number_of_nodes": 4,
  "number_of_data_nodes": 4,
  "active_primary_shards": 66,
  "active_shards": 178,
  "relocating_shards": 0,
  "initializing_shards": 0,
  "unassigned_shards": 0,
  "unassigned_primary_shards": 0,
  "delayed_unassigned_shards": 0,
  "number_of_pending_tasks": 0,
  "number_of_in_flight_fetch": 0,
  "task_max_waiting_in_queue_millis": 0,
  "active_shards_percent_as_number": 100
}

Any help debugging would be appreciated.

This was also opened as a Github issue and I replied there:

ECK Elasticsearch Cluster Diagnosis Summary

Date: 2025-10-16
Cluster: quickstart (default namespace)
Version: Elasticsearch 9.1.5, ECK Operator 2.16.1
Investigation: Suspected ECK operator bug


Summary

Conclusion: Not an ECK operator bug. The operator is functioning correctly and appropriately refusing to proceed with a rolling restart due to inability to verify cluster health.

Elasticsearch node quickstart-es-default-3 is not ready due to DNS resolution failures in the Kubernetes cluster infrastructure. The Elasticsearch cluster is healthy and operational. The ECK operator correctly initiated a rolling restart but cannot complete the operation because DNS queries to Elasticsearch pod FQDNs consistently time out, preventing the operator from verifying cluster safety before proceeding.

Impact: 1 of 4 nodes marked not ready, rolling restart deadlocked for 11+ hours.

Root Cause: Kubernetes DNS infrastructure issue - not an ECK operator defect.


Investigation Findings

1. Elasticsearch Cluster Status (:white_check_mark: Healthy)

Cluster Health:

{
  "status": "green",
  "number_of_nodes": 4,
  "active_shards": 176,
  "unassigned_shards": 0,
  "active_shards_percent_as_number": 100.0
}

Node-3 Status:

  • Container running normally since 2025-10-16T02:36:43Z
  • Successfully joined cluster at 02:37:12Z
  • No errors in Elasticsearch logs
  • All cluster operations functioning correctly

Finding: Elasticsearch is healthy and not experiencing any internal issues.

2. ECK Operator Behavior (:white_check_mark: Working as Designed)

What the operator did correctly:

  1. Initiated rolling restart procedure
  2. Called Elasticsearch Shutdown API to prepare node-3
  3. Node-3 successfully completed shutdown preparation (migrated shards)
  4. Operator attempted to verify cluster health before proceeding
  5. DNS queries failed - operator correctly refused to proceed

Elasticsearch Resource Status:

status:
  availableNodes: 3
  phase: "ApplyingChanges"
  health: "unknown"
  conditions:
    - type: ElasticsearchIsReachable
      status: "False"
      message: "Service has endpoints but Elasticsearch is unavailable"

Finding: Operator is behaving correctly by not proceeding with potentially unsafe operations when it cannot verify cluster state.

3. Node Shutdown State (Expected Behavior)

Shutdown API Status:

{
  "node_id": "PrRzhu4ASKad-wlj7tmX5Q",
  "type": "RESTART",
  "reason": "pre-stop hook",
  "status": "COMPLETE",
  "shard_migration": {"status": "COMPLETE"}
}

Timeline:

  1. Operator marked node for restart (normal orchestration)
  2. Node prepared for shutdown - shards migrated successfully
  3. Operator attempted to verify cluster health before terminating pod
  4. DNS resolution failed - operator waiting for connectivity
  5. Readiness probe fails (expected - node in shutdown mode)

Finding: The shutdown preparation completed successfully. The operator is correctly waiting to verify cluster safety before proceeding with pod termination.

4. Root Cause: DNS Infrastructure Failure (:red_circle: External Issue)

Persistent DNS timeout errors from Oct 14-16:

Error: "dial tcp: lookup quickstart-es-default-[0-3].quickstart-es-default.default 
       on 172.20.0.10:53: read udp 10.200.0.253:XXXXX->172.20.0.10:53: i/o timeout"

DNS Configuration:

  • DNS Server: 172.20.0.10:53 (CoreDNS)
  • ECK Operator IP: 10.200.0.253
  • Target FQDNs: quickstart-es-default-{0,1,2,3}.quickstart-es-default.default

What's happening:

  1. Operator makes HTTPS request to Elasticsearch API
  2. Kubernetes attempts DNS lookup of pod FQDN
  3. DNS query times out (infrastructure failure)
  4. Operator cannot reach Elasticsearch API
  5. Operator cannot verify cluster is safe
  6. Operator correctly refuses to proceed

Finding: The DNS infrastructure (CoreDNS) is not responding to queries from the operator pod. This is a Kubernetes platform issue, not an ECK operator issue.


Root Cause Analysis

The operator is not at fault. The ECK operator is working as designed:

  1. :white_check_mark: Correctly initiated rolling restart
  2. :white_check_mark: Properly used Elasticsearch Shutdown API
  3. :white_check_mark: Validated shutdown preparation completed
  4. :white_check_mark: Attempted to verify cluster health before proceeding
  5. :white_check_mark: Detected inability to reach Elasticsearch
  6. :white_check_mark: Appropriately refused to proceed with potentially unsafe operations
  7. :white_check_mark: Logged clear error messages indicating DNS resolution failure

The infrastructure is at fault:

Evidence of DNS Infrastructure Failure:

  • DNS timeout errors span multiple days (Oct 14-16)
  • Affects all Elasticsearch node FQDNs intermittently
  • Consistent pattern: queries to CoreDNS (172.20.0.10:53) timeout
  • Elasticsearch cluster remains healthy throughout (proving it's not an ES issue)
  • No ECK operator code defects or logic errors observed

Likely DNS Infrastructure Causes:

  • CoreDNS pods overloaded, unhealthy, or under-resourced
  • Network policy blocking DNS queries from operator namespace
  • CNI plugin issues causing packet loss between pods
  • DNS service disruption or rate limiting

Conclusion

This is definitively not an ECK operator bug. The operator is functioning correctly and exhibiting appropriate defensive behavior.

The ECK operator properly:

  • Uses Elasticsearch APIs according to best practices
  • Implements safe orchestration by verifying cluster health before destructive operations
  • Logs clear diagnostic information about connectivity failures
  • Refuses to proceed when it cannot guarantee safety

The actual problem is a Kubernetes DNS infrastructure failure where queries from the ECK operator pod to CoreDNS consistently time out. The operator has no control over DNS resolution - it correctly relies on Kubernetes platform services, which are currently failing.

Operator behavior is correct: It is safer for the operator to wait indefinitely than to proceed with pod termination when cluster health cannot be verified. This prevents potential data loss or cluster instability.

Fix Required: Resolve the DNS resolution failures in the Kubernetes cluster infrastructure. This is a platform/infrastructure issue, not an application-layer (ECK) issue.


Appendix: Error Samples

DNS Timeout from Operator (Oct 16, 02:42:52):

{
  "log.level": "error",
  "@timestamp": "2025-10-16T02:42:52.860Z",
  "error": "dial tcp: lookup quickstart-es-default-2.quickstart-es-default.default on 172.20.0.10:53: read udp 10.200.0.253:44991->172.20.0.10:53: i/o timeout"
}

Operator Correctly Detecting Unreachability:

{
  "log.level": "info",
  "@timestamp": "2025-10-16T02:43:31.933Z",
  "message": "Elasticsearch cannot be reached yet, re-queuing",
  "namespace": "default",
  "es_name": "quickstart"
}

These logs demonstrate the operator is working correctly: detecting the connectivity issue, logging it clearly, and re-queuing the reconciliation rather than proceeding unsafely.


AI-generated Diagnosis Based On: ECK Diagnostics Bundle (2025-10-16T13:49:06)

================
Diagnosis seems plausible to me. I don't think we have a bug here.

1 Like