Elastic Cloud on Kubernets not starting - advice needed on (probable) cause

paoloyx · July 24, 2020, 3:40pm

Hi all,

i'me currently after an elasticsearch cluster that is not starting up. Cluster has been deployed by others using a home-made helm chart, and the Elastic Cloud on Kubernetes

All the Helm chart is based on the following yaml:

apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
  name: {{ .Release.Name }}
  namzespace: {{ .Release.Namespace }}
spec:
...
...
  [skipped_data]
...
...
  - name: default-hot
    count: 2
    config:
      node.attr.data: hot
      node.master: true
      node.data: true
      node.ingest: true
      node.ml: false
      xpack.ml.enabled: false
      node.store.allow_mmap: false
      cluster.remote.connect: false
      cluster.routing.allocation.enable: all
      cluster.routing.allocation.node_concurrent_incoming_recoveries: 4
      cluster.routing.allocation.node_concurrent_outgoing_recoveries: 4
      cluster.routing.allocation.node_initial_primaries_recoveries: 8
      cluster.routing.allocation.same_shard.host: false
      cluster.routing.rebalance.enable: all
      cluster.routing.allocation.allow_rebalance: indices_all_active
      cluster.routing.allocation.cluster_concurrent_rebalance: 4
      cluster.routing.allocation.balance.shard: 0.45f
      cluster.routing.allocation.balance.index: 0.55f
      cluster.routing.allocation.balance.threshold: 1.0f
      cluster.routing.allocation.disk.threshold_enabled: true
      cluster.routing.allocation.disk.watermark.low: 85%
      cluster.routing.allocation.disk.watermark.high: 90%
      cluster.routing.allocation.disk.watermark.flood_stage: 95%
      cluster.info.update.interval: 240s
      cluster.routing.allocation.disk.include_relocations: true

After an hard restart of all Pods, either elastic-aocc-es-default-hot-0 or elastic-aocc-es-default-hot-1 of cluster are not starting, both logs are showing that no master node can be found:

{"type": "server", "timestamp": "2020-07-23T03:50:38,486Z", "level": "WARN", "component": "o.e.c.c.ClusterFormationFailureHelper", "cluster.name": "elastic-aocc", "node.name": "elastic-aocc-es-default-hot-1", "message": "master not discovered yet, this node has not previously joined a bootstrapped (v7+) cluster, and [cluster.initial_master_nodes] is empty on this node

My questions:

I can see that in yaml configuration there is NO mention of cluster.initial_master_nodes property. Could this be the problem, actually?
If yes, all i need is to add the property to yaml and upgrade the helm chart, is correct?
There will be data loss?

Thanks a lot

sebgl · July 28, 2020, 9:46am

ECK takes care of this setting, you should not set it yourself in the configuration.
That setting should only be set on the first cluster bootstrap, then be removed from the configuration. This is what ECK is doing behind the scenes.

In your situation where all Pods where restarted, they should normally come back and reuse the same PersistentVolumes, with the same data. So the cluster recognizes its existing cluster state data and can move on.

Based on master not discovered yet, this node has not previously joined a bootstrapped (v7+) cluster , it looks like the data of the Pod is missing (data loss), they were probably recreated with an empty volume.

Can you share more details about your PersistentVolumes volumeClaimTemplates setup?

paoloyx · July 30, 2020, 6:57am

Ok Sebastien, thank you for your clarifications.
I simply didn't had time to further investigate the causes, nor the capacity to fully understand what was actually happening under the hood....so, in short terms, the problem has actually been solved by bootstrapping a brand new cluster (via a clean release of the custom helm chart).

Probably this solution won't help others, but it worked for us. I think however that the situation was "poisoned" since the beginning, that there were not feasible alternatives.

Topic		Replies	Views
Trouble with installing ECK on my RKE2 Kubernetes cluster Elastic Cloud on Kubernetes (ECK)	12	1180	June 9, 2023
Fails to deploy ES on Kubernetes Elastic Cloud on Kubernetes (ECK)	3	710	November 4, 2022
Elasicsearch cluster is not created Elastic Cloud on Kubernetes (ECK)	12	1035	November 4, 2022
ECK 1.0.0-beta1 doesn't start pod Elastic Cloud on Kubernetes (ECK)	3	1375	November 4, 2022
Elasticsearch cluster on k8s Stuck Elastic Cloud on Kubernetes (ECK) docker	1	280	May 31, 2023

Elastic Cloud on Kubernets not starting - advice needed on (probable) cause

Related topics