Cluster never recovers on baremetal cloud-on-k8s instance

tried to install it several times starting from 0.8.0 operator, and 0.8.1, always the same issue.

whenever we try to reboot one or few machines, cluster never recovers. all masters become unavailable and it continues to try to recover, but fails indefinitely, here's a snippet of log from kubetail'ing all pods. So far ready to give up on the operator, looks like we don't have enough technical knowledge to be able to run this in production and should issues occur - we have no way:

not sure if these events would be any helpful:

LAST SEEN   TYPE      REASON                   OBJECT                                             MESSAGE
1s          Warning   BackOff                  pod/kibana-kibana-c98867586-4594g                  Back-off restarting failed container
1s          Warning   BackOff                  pod/kibana-kibana-c98867586-4594g                  Back-off restarting failed container
1s          Warning   Unhealthy                pod/kibana-kibana-c98867586-4594g                  Readiness probe failed: HTTP probe failed with statuscode: 503
114s        Warning   FailedToUpdateEndpoint   endpoints/elastic-es-discovery                     Failed to update endpoint ops/elastic-es-discovery: Operation cannot be fulfilled on endpoints "elastic-es-discovery": the object has been modified; please apply your changes to the latest version and try again
114s        Warning   FailedToUpdateEndpoint   endpoints/elastic-es                               Failed to update endpoint ops/elastic-es: Operation cannot be fulfilled on endpoints "elastic-es": the object has been modified; please apply your changes to the latest version and try again
1s          Normal    Killing                  pod/elastic-es-qqtm956plr                          Stopping container elasticsearch
0s          Normal    Killing                  pod/elastic-es-k2bcb29q68                          Stopping container elasticsearch
115s        Warning   FailedToUpdateEndpoint   endpoints/elastic-es                               Failed to update endpoint ops/elastic-es: Operation cannot be fulfilled on endpoints "elastic-es": the object has been modified; please apply your changes to the latest version and try again
0s          Normal    Killing                  pod/elastic-es-pzthtpb4mh                          Stopping container elasticsearch
0s          Normal    Killing                  pod/elastic-es-9hdbl2tzj7                          Stopping container elasticsearch
0s          Normal    Killing                  pod/elastic-es-4vkcnm5kxv                          Stopping container elasticsearch
0s          Warning   Unhealthy                pod/elastic-es-9hdbl2tzj7                          Readiness probe failed:
0s          Warning   Unhealthy                pod/elastic-es-pzthtpb4mh                          Readiness probe failed:
0s          Warning   Unhealthy                pod/elastic-es-9hdbl2tzj7                          Readiness probe failed:
0s          Warning   Unhealthy                pod/elastic-es-9hdbl2tzj7                          Readiness probe failed:
1s          Warning   Unhealthy                pod/kibana-kibana-c98867586-4594g                  Readiness probe failed: HTTP probe failed with statuscode: 503

Or these logs:
[elastic-es-655t98ckzr cert-initializer] 2019-07-10T16:46:32.499Z INFO certificate-initializer No private key found on disk, will create one {"reason": "open /mnt/elastic/private-key/node.key: no such file or directory"}
[elastic-es-655t98ckzr prepare-fs] at org.elasticsearch.cli.Command.main(Command.java:90)
[elastic-es-655t98ckzr prepare-fs] at org.elasticsearch.plugins.PluginCli.main(PluginCli.java:47)
[elastic-es-655t98ckzr cert-initializer] 2019-07-10T16:46:32.499Z INFO certificate-initializer Creating a private key on disk
[elastic-es-655t98ckzr prepare-fs] Installed plugins:
[elastic-es-655t98ckzr cert-initializer] 2019-07-10T16:46:32.815Z INFO certificate-initializer Generating a CSR from the private key
[elastic-es-655t98ckzr prepare-fs] Plugins installation duration: 52 sec.
[elastic-es-655t98ckzr cert-initializer] 2019-07-10T16:46:32.818Z INFO certificate-initializer Serving CSR over HTTP {"port": 8001}
[elastic-es-655t98ckzr cert-initializer] 2019-07-10T16:46:32.818Z INFO certificate-initializer Watching filesystem for cert update

also see errors like this in elastic-operator logs:

{"level":"info","ts":1562781074.3130052,"logger":"elasticsearch-controller","msg":"End reconcile iteration","iteration":908,"took":10.00281855,"request":"ops/elastic"}
{"level":"error","ts":1562781074.3130827,"logger":"kubebuilder.controller","msg":"Reconciler error","controller":"elasticsearch-controller","request":"ops/elastic","error":"Get http://10.42.7.49:8001/csr: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)","errorCauses":[{"error":"Get http://10.42.7.49:8001/csr: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)"}],"stacktrace":"github.com/elastic/cloud-on-k8s/operators/vendor/github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/src/github.com/elastic/cloud-on-k8s/operators/vendor/github.com/go-logr/zapr/zapr.go:128\ngithub.com/elastic/cloud-on-k8s/operators/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/elastic/cloud-on-k8s/operators/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:217\ngithub.com/elastic/cloud-on-k8s/operators/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1\n\t/go/src/github.com/elastic/cloud-on-k8s/operators/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:158\ngithub.com/elastic/cloud-on-k8s/operators/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/go/src/github.com/elastic/cloud-on-k8s/operators/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133\ngithub.com/elastic/cloud-on-k8s/operators/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/src/github.com/elastic/cloud-on-k8s/operators/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134\ngithub.com/elastic/cloud-on-k8s/operators/vendor/k8s.io/apimachinery/pkg/util/wait.Until\n\t/go/src/github.com/elastic/cloud-on-k8s/operators/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88"}

and even i if i try to re-create cluster by deleting CRD elastic and re-creating it - everything gets stuck in this state:

$ kubectl get po -w
NAME                             READY   STATUS     RESTARTS   AGE
elastic-es-7thrknt9hg            0/1     Init:3/4   0          4m36s
elastic-es-87kzhhhmkh            0/1     Init:3/4   0          4m36s
elastic-es-8jrvg8wcpn            0/1     Init:3/4   0          4m33s
elastic-es-8m492rzpqj            0/1     Init:3/4   0          4m36s
elastic-es-8wql4rg7sp            0/1     Init:3/4   0          4m35s
elastic-es-gm2rbcgwv2            0/1     Init:3/4   0          4m33s
elastic-es-gzgnd7f8bz            0/1     Init:3/4   0          4m36s
elastic-es-km4rcpdwqp            0/1     Init:3/4   0          4m33s
elastic-es-lqkdpgzcj7            0/1     Init:3/4   0          4m31s
elastic-es-vmkgbs8mdz            0/1     Init:3/4   0          4m31s
elastic-es-wqsq4sfc74            0/1     Init:3/4   0          4m36s

Hey @Alexei_Smirnov,

Can you share your Elasticsearch cluster yaml specification?
Are you using PersistentVolumes? Which storage class implementation?
Nodes should normally come back to life by reusing persistent volumes.

Based on the logs it looks like some pods are waiting for certificates that should be provided by the operator: there could potentially be a bug here. In the upcoming release 0.9 we changed the way certificates are provided to the pod, which would certainly fix the issue here.

Yes, gladly. we use rancher / local-disk-provisioner and here's a CRD

apiVersion: elasticsearch.k8s.elastic.co/v1alpha1
kind: Elasticsearch
metadata:
  name: elastic
  labels:
    kubernetes.io/cluster-service: "true"
    kubernetes.io/name: "Elasticsearch"
spec:
  version: "7.2.0"
  nodes:
    # dedicated data nodes
  - config:
      node.data: true
      node.master: false
      node.ingest: false
      node.ml: false
#      node.attr.attr_name: attr_value
      xpack.monitoring.enabled: true
      xpack.monitoring.collection.enabled: true
    podTemplate:
#      metadata:
#        labels:
#          master: "true"
      spec:
        containers:
        - name: elasticsearch
          resources:
            limits:
              memory: 16Gi
              cpu: 1
    nodeCount: 3
    ## this shows how to request 2Gi of persistent data storage for pods in this topology element
    volumeClaimTemplates:
    - metadata:
        name: data
      spec:
        accessModes:
        - ReadWriteOnce
        resources:
          requests:
            storage: 500Gi
        storageClassName: local-path # or eg. gcePersistentDisk
    # dedicated ML nodes
  - config:
      node.data: false
      node.master: false
      node.ingest: false
      node.ml: true
#      node.attr.attr_name: attr_value
      xpack.monitoring.enabled: true
      xpack.monitoring.collection.enabled: true
    podTemplate:
#      metadata:
#        labels:
#          master: "true"
      spec:
        containers:
        - name: elasticsearch
          resources:
            limits:
              memory: 16Gi
              cpu: 1
    nodeCount: 3
    ## this shows how to request 2Gi of persistent data storage for pods in this topology element
    volumeClaimTemplates:
    - metadata:
        name: data
      spec:
        accessModes:
        - ReadWriteOnce
        resources:
          requests:
            storage: 500Gi
        storageClassName: local-path

    # dedicated master nodes:
  - config:
      node.master: true
      node.data: false
      node.ingest: false
      node.ml: false
      xpack.ml.enabled: true
      xpack.monitoring.enabled: true
      xpack.monitoring.collection.enabled: true
      cluster.remote.connect: true
#      node.attr.attr_name: attr_value
    podTemplate:
#      metadata:
#        labels:
#          data: "true"
      spec:
        containers:
          - name: elasticsearch
            resources:
              limits:
                memory: 4Gi
                cpu: 1
    nodeCount: 3

    # dedicated ingest nodes
  - config:
      node.ingest: true
      node.master: false
      node.data: false
      node.ml: false
#      node.attr.attr_name: attr_value
    podTemplate:
#      metadata:
#        labels:
#          client: "true"
      spec:
        containers:
          - name: elasticsearch
            resources:
              limits:
                memory: 8Gi
                cpu: 1
    nodeCount: 2



  ## Inject secure settings into Elasticsearch nodes from a k8s secret reference
  # secureSettings:
  #   secretName: "ref-to-secret"
  ## Add a list of SANs into the nodes certificates
  http:
    service:
      metadata:
        annotations:
          metallb.universe.tf/address-pool: local
          metallb.universe.tf/allow-shared-ip: infra-db
      spec:
        type: LoadBalancer
        loadBalancerIP: "10.255.42.67"

Tried installing 0.9.0 and appears to have the same issue. Could you elaborate on ways to troubleshoot? Basically new install on 0.9.0 doesn't allow single node to re-join after the node is restarted.

Unlike demo docs - we use rancher/local-path-provisioner as the local disk storageClass, not sure if there's any conflicting functionality here, but appears to be pretty much exact same.

Thank you