Hello everyone,
We had recently an upgrade of K8s nodes, during which one all pods were killed. Since the Elastic instances were deployed via ECK, the ECK operator automatically recreated the pods. However, the Elastic pod was marked as running but not ready:
# kubectl get pod elastic-main
NAME READY STATUS RESTARTS AGE
main-es-master-0 0/1 Running 0 5d20h
There were no errors in the pod logs, but the pod description showed the following error:
# kubectl describe pod elastic-main
Events:
Warning Unhealthy 25s (x184 over 14m) kubelet Readiness probe failed: nc: connect to 127.0.0.1 port 8080 (tcp) failed: Connection refused
While the connection to Elastic on port 9200 was successful, the connection to port 8080 failed from inside the container:
# kubectl exec -it elastic-main -- bash
es@main-es-master-0:/usr/share/elasticsearch$ nc -z -v -w5 127.0.0.1 9200
Connection to 127.0.0.1 9200 port [tcp/*] succeeded!
es@main-es-master-0:/usr/share/elasticsearch$ nc -z -v -w5 127.0.0.1 8080
nc: connect to 127.0.0.1 port 8080 (tcp) failed: Connection refused
and indeed there was no service listening on port 8080 (while for correct Elastic there is):
es@main-es-master-0:/usr/share/elasticsearch$ cat /proc/net/tcp6 | grep " 0A " | awk '{print $2}' | cut -d: -f2 | xargs -I{} printf "%d\n" 0x{}
9200
9300
The cluster status was green though:
curl -u "admin:1qazXSW@" -k "https://localhost:9200/_cluster/health?filter_path=status,*_shards?pretty"
{"status":"green"}
There were no issues allocating shards:
curl -u "admin:1qazXSW@" -k "https://localhost:9200/_cluster/allocation/explain?pretty"
{
"error" : {
"root_cause" : [
{
"type" : "illegal_argument_exception",
"reason" : "No shard was specified in the request..."
}
],
"type" : "illegal_argument_exception",
"reason" : "No shard was specified in the request..."
},
"status" : 400
}
or problems with the disk space
I found that only removing Elastic along with its PVC and redeploying both from scratch resolves the issue.
I was able to reproduce the problem once by deploying Elastic, attempting to delete the Elastic pod, interrupting the deletion with CTRL+C, and then forcibly removing the Elastic pod with the --force
flag. After this, ECK tries to recreate the pod but it failed with the following error:
{"@timestamp":"2024-10-15T10:15:50.036Z", "log.level":"DEBUG", "message":"address [10.10.10.10:9300], node [unknown discovery result", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[feature-userstory-1733-es-master-0][generic][T#4]","log.logger":"org.elasticsearch.discovery.PeerFinder","elasticsearch.cluster.uuid":"PhXNclNITlux-nuW6ymzhA","elasticsearch.node.id":"hUKTwpuqSEiiZ8A8fWKUVA","elasticsearch.node.name":"feature-userstory-1733-es-master-0","elasticsearch.cluster.name":"feature-userstory-1733","error.type":"org.elasticsearch.transport.ConnectTransportException","error.message":"[][10.10.10.10:9300] connect_timeout[30s]","error.stack_trace":"org.elasticsearch.transport.ConnectTransportException: [][10.10.10.10:9300] connect_timeout[30s]\n\tat org.elasticsearch.server@8.15.0/org.elasticsearch.transport.TcpTransport$ChannelsConnectedListener.onTimeout(TcpTransport.java:1150)\n\tat org.elasticsearch.server@8.15.0/org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:917)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)\n\tat java.base/java.lang.Thread.run(Thread.java:1570)\n"}
ECK tried the recreate it again and failed one more time
Finally, on the third attempt, it recreateed successfully, Elastic health was green, but a general "Readiness probe failed" error message remains in the pod description and the pod was marked as "Not Ready"
Currently, I have workarounded the issue with a custom readiness probe (though I know it is not recommended for versions >8.2.0):
spec:
containers:
- name: elasticsearch
readinessProbe:
exec:
command:
- bash
- -c
- |
curl -s -k https://localhost:9200 | grep -q "missing authentication credentials"
failureThreshold: 3
initialDelaySeconds: 10
periodSeconds: 12
successThreshold: 1
timeoutSeconds: 12
which allowed me to check on 9200 instead of 8080
Summary:
- The Pod is marked as "Not Ready"
- The error message in the pod description is uninformative and does not clarify the misconfiguration.
- No errors were found in the logs, even with debug enabled.
- The pod is actually ready and responsive.
Details:
Elastic: 8.15.0
ECK: 2.13.0
Platform: Openshift
Any insights or suggestions on how to address this issue would be greatly appreciated!
Thank you!