Elastic Master Nodes not discovered exception on ECK

Hi,

We have a 2 node cluster which has all master, data, ingest and ml roles. This ECK setup is deployed using Oracle Cloud for Kubernetes Engine (OKE).

Recently, we were trying to upgrade the Kubernetes version on OKE nodes, since lower minor version support is removed by Oracle.

While doing this upgrade, first of all, we shifted elastic-operator in one of the new nodes. After which, we had to shift data node. As per "Storage Recommendations" page on ECK Documentation for "Host Failure" topic, we deleted only one node Pod (out of 2) and related PVC. The expected was that the data from old PV will get transferred to new PV. However, we noticed that both the Elasticsearch nodes were failing with below error.

"message": "master not discovered yet, this node has not previously joined a bootstrapped (v7+) cluster, and [cluster.initial_master_nodes] is empty on this node

{"type": "server", "timestamp": "2022-05-30T09:45:22,554Z", "level": "WARN", "component": "r.suppressed", "cluster.name": "quickstart", "node.name": "quickstart-es-nodeset-1", "message": "path: /_cluster/health, params: {}", "stacktrace": ["org.elasticsearch.discovery.MasterNotDiscoveredException: null"...

Due to above, there's a chance of losing data (which is definitely not preferred. Looking for suggestions on below:

  1. How can we recover these master nodes and restart Elasticsearch?
  2. How can we upgrade Kubernetes nodes version where restart of elastic-operator or master nodes does not affect the state of Elasticsearch?

Hello @adityasinghal26

Welcome to the Elastic community!

First of all,

if you are upgrading your Kubernetes provider (OKE) you don't have to do anything on ECK, behind the scenes it will drain the node and run a rolling upgrade and the statefulset will recreate the pod into another node while upgrading the first one and the corresponding PVC will be re-attached to the new node using the existing PV.

Now, if you delete the PV there's no way to recover the data/disk because you just deleted it. There's no way to delete the PVC if the Elasticsearch pod is still running.

I would suggest you use at least 3 Elasticsearch pods, we should have at least 3 master nodes, please see https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-discovery-quorums.html

Remember that you should always consider taking an Elasticsearch snapshot before you perform any kind of upgrade, in case of failure you can easily recover it from there.

On the Elasticsearch setting, it's very important to have at the least 3 nodes (as I mentioned before) and each indices must have at the least 1 replica, with that you can guarantee high-availibility during upgrades, you should also consider your OKE setting, if you have only one node there's nothing we can do.

Hello @framsouza,

During the node drain, we did not face any issue with elastic-operator pod. The operator pod started running fine on the new node as you mentioned.

However, while doing the drain for master nodes (quickstart-es-nodeset), the Pod went into Pending state. When described this pod, we observed that none of the available nodes matched Pod's volumeAffinity. Upon checking ECK documentation, I deleted PVC of that Pod. With this, the new Pod came up with new PVC but it was throwing MasterNotDiscoveredException.

Right now, I want to understand how I can resolve this MasterNotDiscoveredException and get master nodes running.

Also, as you suggested, I am planning to upgrade nodeset with 3 replicas. With old ECK cluster master nodes being down, will elastic-operator copy the data from old PVs to the new nodeset cluster PVs?

Thanks,
Aditya

This documentation you are referring to is assuming you are running Elasticsearch in a resilient or high-availability configuration. The documentation also only applies to local persistent volumes (it is not clear to me from your post if you are using that kind of volume but I assume so for the moment)

Please see the Elasticsearch documentation I linked here as to what constitutes a HA setup. Designing for resilience | Elasticsearch Guide [8.2] | Elastic

As @framsouza said the data on the disk you deleted is probably gone. You will have to rebuild your cluster from scratch. You can potentially salvage the remaining data on the node that is still running with the help of the elasticsearch-node tool by re-bootstrapping the cluster as single node cluster.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.