ECK - "master not discovered yet, this node has not previously joined a bootstrapped (v7+) cluster

Hi team,

We are facing a strange issue where when we are giving master node count in the elasticsearch Custom resource more than 1, the master nodes are not able to elect the master, we have tried giving odd numbers upto 11 and still master election didn't happen, logs from the master pod:

{"type": "server", "timestamp": "2020-06-20T09:25:59,336Z", "level": "WARN", "component": "o.e.c.c.ClusterFormationFailureHelper", "cluster.name": "elasticsearch-config", "
node.name": "elasticsearch-config-es-master-0", "message": "master not discovered yet, this node has not previously joined a bootstrapped (v7+) cluster, and this node must
discover master-eligible nodes [elasticsearch-config-es-master-0, elasticsearch-config-es-master-1, elasticsearch-config-es-master-2, elasticsearch-config-es-master-3, elas
ticsearch-config-es-master-4] to bootstrap a cluster: have discovered [{elasticsearch-config-es-master-0}{zHswPo-WT6uH0ZrEt4tgdQ}{FNfaLOFVQGCEWEj4vLG1mA}{10.124.15.227}{10.
124.15.227:9300}{lm}{ml.machine_memory=3221225472, xpack.installed=true, ml.max_open_jobs=20}]; discovery will continue using [127.0.0.1:9300, 127.0.0.1:9301, 127.0.0.1:930
2, 127.0.0.1:9303, 127.0.0.1:9304, 127.0.0.1:9305, 10.124.14.144:9300, 10.124.24.146:9300, 10.124.3.180:9300, 10.124.6.67:9300] from hosts providers and [{elasticsearch-con
fig-es-master-0}{zHswPo-WT6uH0ZrEt4tgdQ}{FNfaLOFVQGCEWEj4vLG1mA}{10.124.15.227}{10.124.15.227:9300}{lm}{ml.machine_memory=3221225472, xpack.installed=true, ml.max_open_jobs
=20}] from last-known cluster state; node term 0, last-accepted version 0 in term 0" }

it seems strange as we are using same config to deploy on multiple namespaces in a cluster and it was working fine,

suddenly this issue came out of nowhere.

only change happened in GKE cluster is it was patched from 1.15 to 1.16, will that create an issue?

one more observation after troubleshooting:

the service used to connect between master pods elasticsearch-config-es-transport has an internal endpoints associated with the pods

Name:              elasticsearch-config-es-transport
Namespace:         test-logstash-1
Labels:            common.k8s.elastic.co/type=elasticsearch
                   elasticsearch.k8s.elastic.co/cluster-name=elasticsearch-config
Annotations:       <none>
Selector:          common.k8s.elastic.co/type=elasticsearch,elasticsearch.k8s.elastic.co/cluster-name=elasticsearch-config
Type:              ClusterIP
IP:                None
Port:              <unset>  9300/TCP
TargetPort:        9300/TCP
Endpoints:         10.124.14.144:9300,10.124.15.227:9300,10.124.24.146:9300 + 5 more...
Session Affinity:  None
Events:            <none>

i am trying to nc from the pod to other pods and i am getting connection refused:

from pod 2 to pod 1 i.e. master2 to master 1
nc -vz  10.124.14.144 9300
connection refused

i am wondering how networking is affected as these pods are in same namespace and are in same network, i am able to ping though.

some more troubleshooting:

  1. internal IPs of the master pods :
    kk8soptr@jumpbox:~$ kubectl get pods -l master=node -n test-logstash-1 -o go-template='{{range .items}}{{.status.podIP}}{{"\n"}}{{end}}'
    10.124.15.227
    10.124.6.67
    10.124.3.180
    10.124.24.146
    10.124.14.144
  2. testing from within the pods :
   [root@logstash-0 logstash]#  for ep in 10.124.15.227:9300 10.124.6.67:9300 10.124.3.180:9300 10.124.24.146:9300 10.124.14.144:9300; do
   >     wget -qO- $ep
   > done
   no response
   [root@elasticsearch-config-es-master-0 elasticsearch]#  for ep in 10.124.15.227:9300 10.124.6.67:9300 10.124.3.180:9300 10.124.24.146:9300 10.124.14.144:9300; do     wget -
    qO- $ep; done
  no response

Thanks in advance

I am looking more into it and seems like a GKE cluster issue, i am able to spin up master nodes in one node using nodeAffinity comfortably but when i am not using node affinity the pods are not able to communicate on port 9300 as they are spinning up in other nodes of the node pool, seems like there is some firewall which is blocking the connection between nodes in the same node pool