Elasticsearch stuck with cluster_uuid NA

We have a production elastic cluster with a 1.0.1 operator version. We planned to upgrade the version to the latest 1.5 so that we can opt for the dynamic storage changes.

When we ran upgrade of the operator; pods in the cluster started recreating. The first pod recreated with new crd's changes and created a cluster with "cluster_uuid" : "na".

Pods in the cluster are 5 [es-0 to es-4]
Master eligible pods are 3.

Currently, pod es-4 recreated without UUID, so that it is not able to join the existing cluster.

Any suggestions here.

Logs:

[WARN ][o.e.c.c.ClusterFormationFailureHelper] [es-4] master not discovered or elected yet, an election requires at least 3 nodes with ids from [h_EWLH6LQ_uyGXccS180EA, lnygcbVsTcW-iP9tYokRyg, D-EpZDHvRmWYPkYoCmKZOg, Uzm73zuMS1yaRLzPwEm_Bw, 8rVHfnlqTw-o-c_EZwExWA], have discovered [{es-4}{h_EWLH6LQ_uyGXccS180EA}{cn_NslDWSqaNLUD-LZYEnA}{1X.X.X.X37.18}{1X.X.X.X37.18:9300}{cdhilmrstw}{k8s_node_name=gke-node-4868d10d-h4ws, ml.machine_memory=X.X.X.X6736, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}] which is not a quorum; discovery will continue using [X.X.X.X.1:9300, X.X.X.X.1:9301, X.X.X.X.1:9302, X.X.X.X.1:9303, X.X.X.X.1:9304, X.X.X.X.1:9305, 1X.X.X.X3X.X.X.X0, 1X.X.X.X41.18:9300, 1X.X.X.X43.20:9300, 1X.X.X.X58.22:9300] from hosts providers and [{es-4}{h_EWLH6LQ_uyGXccS180EA}{cn_NslDWSqaNLUD-LZYEnA}{1X.X.X.X37.18}{1X.X.X.X37.18:9300}{cdhilmrstw}{k8s_node_name=gke-node-4868d10d-h4ws, ml.machine_memory=X.X.X.X6736, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}] from last-known cluster state; node term 133, last-accepted version 53845 in term 133

Cluster metadata:

# curl -X GET "localhost:9200"
{
  "name" : "es-4",
  "cluster_name" : "es",
  "cluster_uuid" : "_na_",
  "version" : {
    "number" : "7.10.2",
    "build_flavor" : "default",
    "build_type" : "docker",
    "build_hash" : "747e1cc715sHKXqdef077253878a59143c1f785afa92b9",
    "build_date" : "2021-01-13T00:42:12.435326Z",
    "build_snapshot" : false,
    "lucene_version" : "8.7.0",
    "minimum_wire_compatibility_version" : "6.8.0",
    "minimum_index_compatibility_version" : "6.0.0-beta1"
  },
  "tagline" : "You Know, for Search"
}

This node cannot discover any other nodes at the addresses provided. Either the addresses are wrong, or else there's a connectivity problem.

@DavidTurner
The IP addresses that es-4 is trying to connect were correct and no connectivity problems.

Those IPs belong to the nodes in the ES cluster.

FYI: es-{0-4} are statefulsets.

We have 5 pods es-{0-4} in the ES cluster. The elasticsearch operator is managing them with version 1.0.

The latest version is 1.5.
Ref Upgrade ECK | Elastic Cloud on Kubernetes [1.5] | Elastic

As part of the operator upgrade, we have changed the operator version and deployed it.

The operator started recreating pods. First, it kicked off es-4 and it recreated but it is not able to connect to the existing cluster. There are no network connectivity issues.

To test the changes, I tried the same in our test env. The operator recreated es-4 and I had to delete another 2 more nodes/pods[es-3, es-2] to recreate[to satisfy the quorum spec] and form a cluster.

The rest of the nodes/pods es-1 and es-0 taken care of by the operator, those were recreated and joined without any manual deletions.

Is that expected behavior? Do we need to manually kill/delete the pods to join the cluster back?

This seems like a contradiction?

To be more specific: es-4 able to recreate but not able to join the cluster.

Any suggestions here to fix.

I tried upgrading the elasticsearch operator version from 1.0 to 1.5.

As you said, the node is not able to connect to the existing cluster. That's a connectivity problem.

You could try setting logger.org.elasticsearch.discovery: DEBUG to expose the low-level exceptions that Elasticsearch is seeing, but they normally don't tell us much more than we already know here.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.