Elasticsearch won't even try to launch

Try to deploy Elasticsearch into a new EKS 1.18 cluster. For some reason no Elasticsearch instances will even attempt to launch, i.e. not pods even attempt to start. The operator just sits in a reconciliation loop and nothing happens in the rest of Kube.

I can launch Kibana instances, which promptly start crashing because they can't talk to Elasticsearch. So it doesn't seem to be something fundamentally broken with the operator config.

Just to be sure it's not something with my Elasticsearch config tweaks, I'm just using the manifests straight from the ECK 1.2 QuickStart.

Has anyone else run into this?

Steps to replicate:

  1. Spin up EKS 1.18 cluster

  2. Deploy the operator
    kubectl apply -f https://download.elastic.co/downloads/eck/1.2.1/all-in-one.yaml

  3. Create the ES instance

cat <<EOF | kubectl apply -f -
apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
  name: quickstart
spec:
  version: 7.9.3
  nodeSets:
  - name: default
    count: 1
    config:
      node.master: true
      node.data: true
      node.ingest: true
      node.store.allow_mmap: false
EOF
  1. Watch nothing happen

The only error from the logs I can find is:

{"log.level":"info","@timestamp":"2020-10-26T15:59:58.718Z","log.logger":"statefulset","message":"Pod validation skipped","service.version":"1.2.1-b5316231","service.type":"eck","ecs.version":"1.4.0","namespace":"elastic-system","es_name":"quickstart","error":"admission webhook \"iam-for-pods.amazonaws.com\" does not support dry run"}

Seems like maybe the Pod definition could be invalid, but I can't tell in what way. With debug logging enabled I get a bit more:

{"log.level":"debug","@timestamp":"2020-10-26T16:19:47.930Z","log.logger":"controller-runtime.webhook.webhooks","message":"wrote response","service.version":"1.2.1-b5316231","service.type":"eck","ecs.version":"1.4.0","webhook":"/validate-elasticsearch-k8s-elastic-co-v1-elasticsearch","UID":"92ec91bd-d3e1-4693-8200-ce5304094895","allowed":true,"result":{},"resultError":"got runtime.Object without object metadata: &Status{ListMeta:ListMeta{SelfLink:,ResourceVersion:,Continue:,RemainingItemCount:nil,},Status:,Message:,Reason:,Details:nil,Code:200,}"}
{"log.level":"debug","@timestamp":"2020-10-26T16:19:47.947Z","log.logger":"controller-runtime.webhook.webhooks","message":"wrote response","service.version":"1.2.1-b5316231","service.type":"eck","ecs.version":"1.4.0","webhook":"/validate-elasticsearch-k8s-elastic-co-v1-elasticsearch","UID":"69edd3b2-e8f3-4462-9a07-a9f5519a117b","allowed":true,"result":{},"resultError":"got runtime.Object without object metadata: &Status{ListMeta:ListMeta{SelfLink:,ResourceVersion:,Continue:,RemainingItemCount:nil,},Status:,Message:,Reason:,Details:nil,Code:200,}"}
{"log.level":"debug","@timestamp":"2020-10-26T16:19:49.260Z","log.logger":"controller-runtime.webhook.webhooks","message":"wrote response","service.version":"1.2.1-b5316231","service.type":"eck","ecs.version":"1.4.0","webhook":"/validate-elasticsearch-k8s-elastic-co-v1-elasticsearch","UID":"fc841579-e9f0-4708-96f1-1882ea2be364","allowed":true,"result":{},"resultError":"got runtime.Object without object metadata: &Status{ListMeta:ListMeta{SelfLink:,ResourceVersion:,Continue:,RemainingItemCount:nil,},Status:,Message:,Reason:,Details:nil,Code:200,}"}
{"log.level":"info","@timestamp":"2020-10-26T16:19:49.288Z","log.logger":"statefulset","message":"Pod validation skipped","service.version":"1.2.1-b5316231","service.type":"eck","ecs.version":"1.4.0","namespace":"elastic-system","es_name":"quickstart","error":"admission webhook \"iam-for-pods.amazonaws.com\" does not support dry run"}

So I tried disabling the webhook (still with debug logging on). Back to the original error:

{"log.level":"info","@timestamp":"2020-10-26T16:22:08.330Z","log.logger":"statefulset","message":"Pod validation skipped","service.version":"1.2.1-b5316231","service.type":"eck","ecs.version":"1.4.0","namespace":"elastic-system","es_name":"quickstart","error":"admission webhook \"iam-for-pods.amazonaws.com\" does not support dry run"}

I can get rid of the got runtime.Object without object metadata from above if I add metadata to the podTemplate.

apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
  name: quickstart
spec:
  version: 7.3.1
  nodeSets:
    - name: default
      count: 3
      config:
        node.master: true
        node.data: true
        node.ingest: true
        node.store.allow_mmap: false
      podTemplate:
        metadata:
          labels:
            name: elasticsearch-logging
          annotations:
            co.elastic.logs/module: elasticsearch

But this error remains:

{"log.level":"info","@timestamp":"2020-10-26T16:22:08.330Z","log.logger":"statefulset","message":"Pod validation skipped","service.version":"1.2.1-b5316231","service.type":"eck","ecs.version":"1.4.0","namespace":"elastic-system","es_name":"quickstart","error":"admission webhook \"iam-for-pods.amazonaws.com\" does not support dry run"}

And Elasticsearch still won't even try to start.

This just repeats:

{"log.level":"debug","@timestamp":"2020-10-26T16:42:58.567Z","log.logger":"observer","message":"Retrieving cluster state","service.version":"1.2.1-b5316231","service.type":"eck","ecs.version":"1.4.0","es_name":"quickstart","namespace":"elastic-system"}
{"log.level":"debug","@timestamp":"2020-10-26T16:42:59.591Z","log.logger":"observer","message":"Unable to retrieve cluster health","service.version":"1.2.1-b5316231","service.type":"eck","ecs.version":"1.4.0","error":"Get \"https://quickstart-es-http.elastic-system.svc:9200/_cluster/health\": dial tcp 172.20.58.235:9200: connect: connection refused","namespace":"elastic-system","es_name":"quickstart"}
{"log.level":"info","@timestamp":"2020-10-26T16:42:59.610Z","log.logger":"elasticsearch-controller","message":"Starting reconciliation run","service.version":"1.2.1-b5316231","service.type":"eck","ecs.version":"1.4.0","iteration":9,"namespace":"elastic-system","es_name":"quickstart"}
{"log.level":"debug","@timestamp":"2020-10-26T16:42:59.610Z","log.logger":"es-validation","message":"validate create","service.version":"1.2.1-b5316231","service.type":"eck","ecs.version":"1.4.0","name":"quickstart"}
{"log.level":"debug","@timestamp":"2020-10-26T16:42:59.897Z","log.logger":"driver","message":"StatefulSets observedGeneration is not reconciled yet, re-queueing","service.version":"1.2.1-b5316231","service.type":"eck","ecs.version":"1.4.0","namespace":"elastic-system","es_name":"quickstart"}
{"log.level":"info","@timestamp":"2020-10-26T16:42:59.897Z","log.logger":"elasticsearch-controller","message":"Ending reconciliation run","service.version":"1.2.1-b5316231","service.type":"eck","ecs.version":"1.4.0","iteration":9,"namespace":"elastic-system","es_name":"quickstart","took":0.286828326}

Hi @larslevie, thanks for your question.

The log about admission webhook is not related - we do a best effort validation of the Pod spec using dry-run API if it's available. If not, we log the log you see but we don't stop the reconciliation.

I tried the spec you pasted, it does work for me, so to debug further, could you take a look at the StatefulSet that ECK is creating for this ES? You can do that by running kubectl describe statefulsets.apps quickstart-es-default. If it exists (it should), events can tell you more about why Pods are not being created.

Haha, thank you! I got so wrapped up in it "not working" that it didn't even occur to me to check the StatefulSet. :man_facepalming:t3:

Turns out something completely unrelated to the operator was at fault (of course!). This new EKS cluster apparently comes with a default storage class, but I bootstrap my clusters with a default storage class of my own. So k8s had two default storage classes, which prevented the StatefulSet from doing it's thing.

Thanks again for giving me the nudge I needed.

Great to hear! And thanks for the update.