Pod fails to become ready after restart

leighr · November 2, 2025, 10:32pm

elastic operator v2.16.1 is installed in the cluster. k8s is running in Amazon EKS, k8s version 1.32

The following brings up a four node cluster with no errors:

apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
  name: quickstart
spec:
  version: 9.1.5
  nodeSets:
  - name: default
    count: 4
    config:
      node.store.allow_mmap: false
      xpack.monitoring.collection.enabled: true
    podTemplate:
      spec:
        containers:
        - name: elasticsearch
          resources:
            requests:
              memory: "8Gi"
              cpu: "500m"
            limits:
              memory: "16Gi"
              cpu: "2000m"
    volumeClaimTemplates:
    - metadata:
        name: elasticsearch-data
      spec:
        accessModes:
        - ReadWriteOnce
        resources:
          requests:
            storage: 50Gi
  podDisruptionBudget:
    spec:
      maxUnavailable: 1

All pods come up, cluster is created, all pods enter a ready state.

If a pod is rescheduled on a different node (due to a deletion, upgrade, whatever), the new pod comes up and joins the cluster. Cluster state goes green, indexes are copied, everything seems fine, but the pod never receives a ready state. The localhost TCP port 8080 is never opened, the readiness check never passes.
No errors in the log, but also no INFO level log message from the ReadinessService.

Attached is the eck-diagnostics output. At the time of running, quickstart-es-default-3 was in a non-ready state from a k8s perspective, but the cluster was healthy with four nodes.

quickstart-es-default-0          1/1     Running   0             44m
quickstart-es-default-1          1/1     Running   0             44m
quickstart-es-default-2          1/1     Running   0             44m
quickstart-es-default-3          0/1     Running   0             40m
quickstart-kb-66d4b9b97b-tzrbt   1/1     Running   0             44m

{
  "cluster_name": "quickstart",
  "status": "green",
  "timed_out": false,
  "number_of_nodes": 4,
  "number_of_data_nodes": 4,
  "active_primary_shards": 66,
  "active_shards": 178,
  "relocating_shards": 0,
  "initializing_shards": 0,
  "unassigned_shards": 0,
  "unassigned_primary_shards": 0,
  "delayed_unassigned_shards": 0,
  "number_of_pending_tasks": 0,
  "number_of_in_flight_fetch": 0,
  "task_max_waiting_in_queue_millis": 0,
  "active_shards_percent_as_number": 100
}

Any help debugging would be appreciated.

pebrc · December 31, 2025, 3:40pm

This was also opened as a Github issue and I replied there:

ECK Elasticsearch Cluster Diagnosis Summary

Date: 2025-10-16
Cluster: quickstart (default namespace)
Version: Elasticsearch 9.1.5, ECK Operator 2.16.1
Investigation: Suspected ECK operator bug

Summary

Conclusion: Not an ECK operator bug. The operator is functioning correctly and appropriately refusing to proceed with a rolling restart due to inability to verify cluster health.

Elasticsearch node quickstart-es-default-3 is not ready due to DNS resolution failures in the Kubernetes cluster infrastructure. The Elasticsearch cluster is healthy and operational. The ECK operator correctly initiated a rolling restart but cannot complete the operation because DNS queries to Elasticsearch pod FQDNs consistently time out, preventing the operator from verifying cluster safety before proceeding.

Impact: 1 of 4 nodes marked not ready, rolling restart deadlocked for 11+ hours.

Root Cause: Kubernetes DNS infrastructure issue - not an ECK operator defect.

Investigation Findings

1. Elasticsearch Cluster Status ( Healthy)

Cluster Health:

{
  "status": "green",
  "number_of_nodes": 4,
  "active_shards": 176,
  "unassigned_shards": 0,
  "active_shards_percent_as_number": 100.0
}

Node-3 Status:

Container running normally since 2025-10-16T02:36:43Z
Successfully joined cluster at 02:37:12Z
No errors in Elasticsearch logs
All cluster operations functioning correctly

Finding: Elasticsearch is healthy and not experiencing any internal issues.

2. ECK Operator Behavior ( Working as Designed)

What the operator did correctly:

Initiated rolling restart procedure
Called Elasticsearch Shutdown API to prepare node-3
Node-3 successfully completed shutdown preparation (migrated shards)
Operator attempted to verify cluster health before proceeding
DNS queries failed - operator correctly refused to proceed

Elasticsearch Resource Status:

status:
  availableNodes: 3
  phase: "ApplyingChanges"
  health: "unknown"
  conditions:
    - type: ElasticsearchIsReachable
      status: "False"
      message: "Service has endpoints but Elasticsearch is unavailable"

Finding: Operator is behaving correctly by not proceeding with potentially unsafe operations when it cannot verify cluster state.

3. Node Shutdown State (Expected Behavior)

Shutdown API Status:

{
  "node_id": "PrRzhu4ASKad-wlj7tmX5Q",
  "type": "RESTART",
  "reason": "pre-stop hook",
  "status": "COMPLETE",
  "shard_migration": {"status": "COMPLETE"}
}

Timeline:

Operator marked node for restart (normal orchestration)
Node prepared for shutdown - shards migrated successfully
Operator attempted to verify cluster health before terminating pod
DNS resolution failed - operator waiting for connectivity
Readiness probe fails (expected - node in shutdown mode)

Finding: The shutdown preparation completed successfully. The operator is correctly waiting to verify cluster safety before proceeding with pod termination.

4. Root Cause: DNS Infrastructure Failure ( External Issue)

Persistent DNS timeout errors from Oct 14-16:

Error: "dial tcp: lookup quickstart-es-default-[0-3].quickstart-es-default.default 
       on 172.20.0.10:53: read udp 10.200.0.253:XXXXX->172.20.0.10:53: i/o timeout"

DNS Configuration:

DNS Server: 172.20.0.10:53 (CoreDNS)
ECK Operator IP: 10.200.0.253
Target FQDNs: quickstart-es-default-{0,1,2,3}.quickstart-es-default.default

What's happening:

Operator makes HTTPS request to Elasticsearch API
Kubernetes attempts DNS lookup of pod FQDN
DNS query times out (infrastructure failure)
Operator cannot reach Elasticsearch API
Operator cannot verify cluster is safe
Operator correctly refuses to proceed

Finding: The DNS infrastructure (CoreDNS) is not responding to queries from the operator pod. This is a Kubernetes platform issue, not an ECK operator issue.

Root Cause Analysis

The operator is not at fault. The ECK operator is working as designed:

Correctly initiated rolling restart
Properly used Elasticsearch Shutdown API
Validated shutdown preparation completed
Attempted to verify cluster health before proceeding
Detected inability to reach Elasticsearch
Appropriately refused to proceed with potentially unsafe operations
Logged clear error messages indicating DNS resolution failure

The infrastructure is at fault:

Evidence of DNS Infrastructure Failure:

DNS timeout errors span multiple days (Oct 14-16)
Affects all Elasticsearch node FQDNs intermittently
Consistent pattern: queries to CoreDNS (172.20.0.10:53) timeout
Elasticsearch cluster remains healthy throughout (proving it's not an ES issue)
No ECK operator code defects or logic errors observed

Likely DNS Infrastructure Causes:

CoreDNS pods overloaded, unhealthy, or under-resourced
Network policy blocking DNS queries from operator namespace
CNI plugin issues causing packet loss between pods
DNS service disruption or rate limiting

Conclusion

This is definitively not an ECK operator bug. The operator is functioning correctly and exhibiting appropriate defensive behavior.

The ECK operator properly:

Uses Elasticsearch APIs according to best practices
Implements safe orchestration by verifying cluster health before destructive operations
Logs clear diagnostic information about connectivity failures
Refuses to proceed when it cannot guarantee safety

The actual problem is a Kubernetes DNS infrastructure failure where queries from the ECK operator pod to CoreDNS consistently time out. The operator has no control over DNS resolution - it correctly relies on Kubernetes platform services, which are currently failing.

Operator behavior is correct: It is safer for the operator to wait indefinitely than to proceed with pod termination when cluster health cannot be verified. This prevents potential data loss or cluster instability.

Fix Required: Resolve the DNS resolution failures in the Kubernetes cluster infrastructure. This is a platform/infrastructure issue, not an application-layer (ECK) issue.

Appendix: Error Samples

DNS Timeout from Operator (Oct 16, 02:42:52):

{
  "log.level": "error",
  "@timestamp": "2025-10-16T02:42:52.860Z",
  "error": "dial tcp: lookup quickstart-es-default-2.quickstart-es-default.default on 172.20.0.10:53: read udp 10.200.0.253:44991->172.20.0.10:53: i/o timeout"
}

Operator Correctly Detecting Unreachability:

{
  "log.level": "info",
  "@timestamp": "2025-10-16T02:43:31.933Z",
  "message": "Elasticsearch cannot be reached yet, re-queuing",
  "namespace": "default",
  "es_name": "quickstart"
}

These logs demonstrate the operator is working correctly: detecting the connectivity issue, logging it clearly, and re-queuing the reconciliation rather than proceeding unsafely.

AI-generated Diagnosis Based On: ECK Diagnostics Bundle (2025-10-16T13:49:06)

================
Diagnosis seems plausible to me. I don't think we have a bug here.

Topic		Replies	Views
One of master pod is not getting restarted Elastic Cloud on Kubernetes (ECK)	4	484	November 4, 2022
Elasticsearch not coming up after restart Elasticsearch docker	5	4435	September 17, 2022
Elastic pod is not Ready: Readiness probe failed: nc: connect to 127.0.0.1 port 8080 (tcp) failed: Connection refused Elasticsearch	4	970	April 4, 2025
No able to Setup ELK on kubernetes Elastic Cloud on Kubernetes (ECK)	2	1508	July 25, 2022
Readiness probe failed: {"timestamp": "2023-05-12T08:12:09+00:00", "message": "readiness probe failed", "curl_rc": "7"} Elastic Cloud on Kubernetes (ECK) docker	1	1542	June 9, 2023