Elastic pod is not Ready: Readiness probe failed: nc: connect to 127.0.0.1 port 8080 (tcp) failed: Connection refused

Hello everyone,

We had recently an upgrade of K8s nodes, during which one all pods were killed. Since the Elastic instances were deployed via ECK, the ECK operator automatically recreated the pods. However, the Elastic pod was marked as running but not ready:

# kubectl get pod elastic-main

NAME                       READY   STATUS    RESTARTS   AGE
main-es-master-0           0/1     Running   0          5d20h

There were no errors in the pod logs, but the pod description showed the following error:

# kubectl describe pod elastic-main

Events:
  Warning  Unhealthy               25s (x184 over 14m)  kubelet                  Readiness probe failed: nc: connect to 127.0.0.1 port 8080 (tcp) failed: Connection refused

While the connection to Elastic on port 9200 was successful, the connection to port 8080 failed from inside the container:

# kubectl exec -it elastic-main -- bash

es@main-es-master-0:/usr/share/elasticsearch$ nc -z -v -w5 127.0.0.1 9200
Connection to 127.0.0.1 9200 port [tcp/*] succeeded!

es@main-es-master-0:/usr/share/elasticsearch$ nc -z -v -w5 127.0.0.1 8080
nc: connect to 127.0.0.1 port 8080 (tcp) failed: Connection refused

and indeed there was no service listening on port 8080 (while for correct Elastic there is):

es@main-es-master-0:/usr/share/elasticsearch$ cat /proc/net/tcp6 | grep " 0A " | awk '{print $2}' | cut -d: -f2 | xargs -I{} printf "%d\n" 0x{}

9200
9300

The cluster status was green though:

curl -u "admin:1qazXSW@" -k "https://localhost:9200/_cluster/health?filter_path=status,*_shards?pretty"
{"status":"green"}

There were no issues allocating shards:

curl -u "admin:1qazXSW@" -k "https://localhost:9200/_cluster/allocation/explain?pretty"
{
  "error" : {
    "root_cause" : [
      {
        "type" : "illegal_argument_exception",
        "reason" : "No shard was specified in the request..."
      }
    ],
    "type" : "illegal_argument_exception",
    "reason" : "No shard was specified in the request..."
  },
  "status" : 400
}

or problems with the disk space

I found that only removing Elastic along with its PVC and redeploying both from scratch resolves the issue.

I was able to reproduce the problem once by deploying Elastic, attempting to delete the Elastic pod, interrupting the deletion with CTRL+C, and then forcibly removing the Elastic pod with the --force flag. After this, ECK tries to recreate the pod but it failed with the following error:

{"@timestamp":"2024-10-15T10:15:50.036Z", "log.level":"DEBUG", "message":"address [10.10.10.10:9300], node [unknown discovery result", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[feature-userstory-1733-es-master-0][generic][T#4]","log.logger":"org.elasticsearch.discovery.PeerFinder","elasticsearch.cluster.uuid":"PhXNclNITlux-nuW6ymzhA","elasticsearch.node.id":"hUKTwpuqSEiiZ8A8fWKUVA","elasticsearch.node.name":"feature-userstory-1733-es-master-0","elasticsearch.cluster.name":"feature-userstory-1733","error.type":"org.elasticsearch.transport.ConnectTransportException","error.message":"[][10.10.10.10:9300] connect_timeout[30s]","error.stack_trace":"org.elasticsearch.transport.ConnectTransportException: [][10.10.10.10:9300] connect_timeout[30s]\n\tat org.elasticsearch.server@8.15.0/org.elasticsearch.transport.TcpTransport$ChannelsConnectedListener.onTimeout(TcpTransport.java:1150)\n\tat org.elasticsearch.server@8.15.0/org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:917)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)\n\tat java.base/java.lang.Thread.run(Thread.java:1570)\n"}

ECK tried the recreate it again and failed one more time

Finally, on the third attempt, it recreateed successfully, Elastic health was green, but a general "Readiness probe failed" error message remains in the pod description and the pod was marked as "Not Ready"

Currently, I have workarounded the issue with a custom readiness probe (though I know it is not recommended for versions >8.2.0):

spec:
  containers:
  - name: elasticsearch
    readinessProbe:
      exec:
        command:
        - bash
        - -c
        - |
          curl -s -k https://localhost:9200 | grep -q "missing authentication credentials"
      failureThreshold: 3
      initialDelaySeconds: 10
      periodSeconds: 12
      successThreshold: 1
      timeoutSeconds: 12

which allowed me to check on 9200 instead of 8080

Summary:

  • The Pod is marked as "Not Ready"
  • The error message in the pod description is uninformative and does not clarify the misconfiguration.
  • No errors were found in the logs, even with debug enabled.
  • The pod is actually ready and responsive.

Details:

Elastic: 8.15.0
ECK: 2.13.0
Platform: Openshift

Any insights or suggestions on how to address this issue would be greatly appreciated!

Thank you!

I am facing similar issue but in my case the Elasticsearch readiness probe is set as
readinessProbe:
exec:
command:
- bash
- -c
- /mnt/elastic-internal/scripts/readiness-port-script.sh
failureThreshold: 3
initialDelaySeconds: 10
periodSeconds: 5
successThreshold: 1
timeoutSeconds: 5

and it passes
elasticsearch@elasticsearch-es-default-0:~$ /mnt/elastic-internal/scripts/readiness-port-script.sh
Connection to 127.0.0.1 8080 port [tcp/*] succeeded!

while the health is yellow even if the probe failed only 2x

Warning Unhealthy 32m (x2 over 32m) kubelet Readiness probe failed: nc: connect to 127.0.0.1 port 8080 (tcp) failed: Connection refused

version : 8.15.0 (eck-stack chart 0.12.1)