Hi team,
We are facing a strange issue where when we are giving master node count in the elasticsearch Custom resource more than 1, the master nodes are not able to elect the master, we have tried giving odd numbers upto 11 and still master election didn't happen, logs from the master pod:
{"type": "server", "timestamp": "2020-06-20T09:25:59,336Z", "level": "WARN", "component": "o.e.c.c.ClusterFormationFailureHelper", "cluster.name": "elasticsearch-config", "
node.name": "elasticsearch-config-es-master-0", "message": "master not discovered yet, this node has not previously joined a bootstrapped (v7+) cluster, and this node must
discover master-eligible nodes [elasticsearch-config-es-master-0, elasticsearch-config-es-master-1, elasticsearch-config-es-master-2, elasticsearch-config-es-master-3, elas
ticsearch-config-es-master-4] to bootstrap a cluster: have discovered [{elasticsearch-config-es-master-0}{zHswPo-WT6uH0ZrEt4tgdQ}{FNfaLOFVQGCEWEj4vLG1mA}{10.124.15.227}{10.
124.15.227:9300}{lm}{ml.machine_memory=3221225472, xpack.installed=true, ml.max_open_jobs=20}]; discovery will continue using [127.0.0.1:9300, 127.0.0.1:9301, 127.0.0.1:930
2, 127.0.0.1:9303, 127.0.0.1:9304, 127.0.0.1:9305, 10.124.14.144:9300, 10.124.24.146:9300, 10.124.3.180:9300, 10.124.6.67:9300] from hosts providers and [{elasticsearch-con
fig-es-master-0}{zHswPo-WT6uH0ZrEt4tgdQ}{FNfaLOFVQGCEWEj4vLG1mA}{10.124.15.227}{10.124.15.227:9300}{lm}{ml.machine_memory=3221225472, xpack.installed=true, ml.max_open_jobs
=20}] from last-known cluster state; node term 0, last-accepted version 0 in term 0" }
it seems strange as we are using same config to deploy on multiple namespaces in a cluster and it was working fine,
suddenly this issue came out of nowhere.
only change happened in GKE cluster is it was patched from 1.15 to 1.16, will that create an issue?
one more observation after troubleshooting:
the service used to connect between master pods elasticsearch-config-es-transport has an internal endpoints associated with the pods
Name: elasticsearch-config-es-transport
Namespace: test-logstash-1
Labels: common.k8s.elastic.co/type=elasticsearch
elasticsearch.k8s.elastic.co/cluster-name=elasticsearch-config
Annotations: <none>
Selector: common.k8s.elastic.co/type=elasticsearch,elasticsearch.k8s.elastic.co/cluster-name=elasticsearch-config
Type: ClusterIP
IP: None
Port: <unset> 9300/TCP
TargetPort: 9300/TCP
Endpoints: 10.124.14.144:9300,10.124.15.227:9300,10.124.24.146:9300 + 5 more...
Session Affinity: None
Events: <none>
i am trying to nc from the pod to other pods and i am getting connection refused:
from pod 2 to pod 1 i.e. master2 to master 1
nc -vz 10.124.14.144 9300
connection refused
i am wondering how networking is affected as these pods are in same namespace and are in same network, i am able to ping though.
some more troubleshooting:
- internal IPs of the master pods :
kk8soptr@jumpbox:~$ kubectl get pods -l master=node -n test-logstash-1 -o go-template='{{range .items}}{{.status.podIP}}{{"\n"}}{{end}}'
10.124.15.227
10.124.6.67
10.124.3.180
10.124.24.146
10.124.14.144 - testing from within the pods :
[root@logstash-0 logstash]# for ep in 10.124.15.227:9300 10.124.6.67:9300 10.124.3.180:9300 10.124.24.146:9300 10.124.14.144:9300; do
> wget -qO- $ep
> done
no response
[root@elasticsearch-config-es-master-0 elasticsearch]# for ep in 10.124.15.227:9300 10.124.6.67:9300 10.124.3.180:9300 10.124.24.146:9300 10.124.14.144:9300; do wget -
qO- $ep; done
no response
Thanks in advance