Elasticsearch stuck with cluster_uuid NA

We have a production elastic cluster with a 1.0.1 operator version. We planned to upgrade the version to the latest 1.5 so that we can opt for the dynamic storage changes.

When we ran upgrade of the operator; pods in the cluster started recreating. The first pod recreated with new crd's changes and created a cluster with "cluster_uuid" : "na".

Pods in the cluster are 5 [es-0 to es-4]
Master eligible pods are 3.

Currently, pod es-4 recreated without UUID, so that it is not able to join the existing cluster.

Any suggestions here.

Logs:

[WARN ][o.e.c.c.ClusterFormationFailureHelper] [es-4] master not discovered or elected yet, an election requires at least 3 nodes with ids from [h_EWLH6LQ_uyGXccS180EA, lnygcbVsTcW-iP9tYokRyg, D-EpZDHvRmWYPkYoCmKZOg, Uzm73zuMS1yaRLzPwEm_Bw, 8rVHfnlqTw-o-c_EZwExWA], have discovered [{es-4}{h_EWLH6LQ_uyGXccS180EA}{cn_NslDWSqaNLUD-LZYEnA}{1X.X.X.X37.18}{1X.X.X.X37.18:9300}{cdhilmrstw}{k8s_node_name=gke-node-4868d10d-h4ws, ml.machine_memory=X.X.X.X6736, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}] which is not a quorum; discovery will continue using [X.X.X.X.1:9300, X.X.X.X.1:9301, X.X.X.X.1:9302, X.X.X.X.1:9303, X.X.X.X.1:9304, X.X.X.X.1:9305, 1X.X.X.X3X.X.X.X0, 1X.X.X.X41.18:9300, 1X.X.X.X43.20:9300, 1X.X.X.X58.22:9300] from hosts providers and [{es-4}{h_EWLH6LQ_uyGXccS180EA}{cn_NslDWSqaNLUD-LZYEnA}{1X.X.X.X37.18}{1X.X.X.X37.18:9300}{cdhilmrstw}{k8s_node_name=gke-node-4868d10d-h4ws, ml.machine_memory=X.X.X.X6736, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}] from last-known cluster state; node term 133, last-accepted version 53845 in term 133

Cluster metadata:

# curl -X GET "localhost:9200"
{
  "name" : "es-4",
  "cluster_name" : "es",
  "cluster_uuid" : "_na_",
  "version" : {
    "number" : "7.10.2",
    "build_flavor" : "default",
    "build_type" : "docker",
    "build_hash" : "747e1cc715sHKXqdef077253878a59143c1f785afa92b9",
    "build_date" : "2021-01-13T00:42:12.435326Z",
    "build_snapshot" : false,
    "lucene_version" : "8.7.0",
    "minimum_wire_compatibility_version" : "6.8.0",
    "minimum_index_compatibility_version" : "6.0.0-beta1"
  },
  "tagline" : "You Know, for Search"
}

This node cannot discover any other nodes at the addresses provided. Either the addresses are wrong, or else there's a connectivity problem.

@DavidTurner
The IP addresses that es-4 is trying to connect were correct and no connectivity problems.

Those IPs belong to the nodes in the ES cluster.

FYI: es-{0-4} are statefulsets.

We have 5 pods es-{0-4} in the ES cluster. The elasticsearch operator is managing them with version 1.0.

The latest version is 1.5.
Ref Upgrade ECK | Elastic Cloud on Kubernetes [1.5] | Elastic

As part of the operator upgrade, we have changed the operator version and deployed it.

The operator started recreating pods. First, it kicked off es-4 and it recreated but it is not able to connect to the existing cluster. There are no network connectivity issues.

To test the changes, I tried the same in our test env. The operator recreated es-4 and I had to delete another 2 more nodes/pods[es-3, es-2] to recreate[to satisfy the quorum spec] and form a cluster.

The rest of the nodes/pods es-1 and es-0 taken care of by the operator, those were recreated and joined without any manual deletions.

Is that expected behavior? Do we need to manually kill/delete the pods to join the cluster back?

This seems like a contradiction?

To be more specific: es-4 able to recreate but not able to join the cluster.

Any suggestions here to fix.

I tried upgrading the elasticsearch operator version from 1.0 to 1.5.

As you said, the node is not able to connect to the existing cluster. That's a connectivity problem.

You could try setting logger.org.elasticsearch.discovery: DEBUG to expose the low-level exceptions that Elasticsearch is seeing, but they normally don't tell us much more than we already know here.