Elastic Cloud on Kubernets not starting - advice needed on (probable) cause

Hi all,

i'me currently after an elasticsearch cluster that is not starting up. Cluster has been deployed by others using a home-made helm chart, and the Elastic Cloud on Kubernetes

All the Helm chart is based on the following yaml:

apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
  name: {{ .Release.Name }}
  namzespace: {{ .Release.Namespace }}
spec:
...
...
  [skipped_data]
...
...
  - name: default-hot
    count: 2
    config:
      node.attr.data: hot
      node.master: true
      node.data: true
      node.ingest: true
      node.ml: false
      xpack.ml.enabled: false
      node.store.allow_mmap: false
      cluster.remote.connect: false
      cluster.routing.allocation.enable: all
      cluster.routing.allocation.node_concurrent_incoming_recoveries: 4
      cluster.routing.allocation.node_concurrent_outgoing_recoveries: 4
      cluster.routing.allocation.node_initial_primaries_recoveries: 8
      cluster.routing.allocation.same_shard.host: false
      cluster.routing.rebalance.enable: all
      cluster.routing.allocation.allow_rebalance: indices_all_active
      cluster.routing.allocation.cluster_concurrent_rebalance: 4
      cluster.routing.allocation.balance.shard: 0.45f
      cluster.routing.allocation.balance.index: 0.55f
      cluster.routing.allocation.balance.threshold: 1.0f
      cluster.routing.allocation.disk.threshold_enabled: true
      cluster.routing.allocation.disk.watermark.low: 85%
      cluster.routing.allocation.disk.watermark.high: 90%
      cluster.routing.allocation.disk.watermark.flood_stage: 95%
      cluster.info.update.interval: 240s
      cluster.routing.allocation.disk.include_relocations: true

After an hard restart of all Pods, either elastic-aocc-es-default-hot-0 or elastic-aocc-es-default-hot-1 of cluster are not starting, both logs are showing that no master node can be found:

{"type": "server", "timestamp": "2020-07-23T03:50:38,486Z", "level": "WARN", "component": "o.e.c.c.ClusterFormationFailureHelper", "cluster.name": "elastic-aocc", "node.name": "elastic-aocc-es-default-hot-1", "message": "master not discovered yet, this node has not previously joined a bootstrapped (v7+) cluster, and [cluster.initial_master_nodes] is empty on this node

My questions:

  1. I can see that in yaml configuration there is NO mention of cluster.initial_master_nodes property. Could this be the problem, actually?

  2. If yes, all i need is to add the property to yaml and upgrade the helm chart, is correct?

  3. There will be data loss?

Thanks a lot :slight_smile:

ECK takes care of this setting, you should not set it yourself in the configuration.
That setting should only be set on the first cluster bootstrap, then be removed from the configuration. This is what ECK is doing behind the scenes.

In your situation where all Pods where restarted, they should normally come back and reuse the same PersistentVolumes, with the same data. So the cluster recognizes its existing cluster state data and can move on.

Based on master not discovered yet, this node has not previously joined a bootstrapped (v7+) cluster , it looks like the data of the Pod is missing (data loss), they were probably recreated with an empty volume.

Can you share more details about your PersistentVolumes volumeClaimTemplates setup?

Ok Sebastien, thank you for your clarifications.
I simply didn't had time to further investigate the causes, nor the capacity to fully understand what was actually happening under the hood....so, in short terms, the problem has actually been solved by bootstrapping a brand new cluster (via a clean release of the custom helm chart).

Probably this solution won't help others, but it worked for us. I think however that the situation was "poisoned" since the beginning, that there were not feasible alternatives.