ECK 1.0.0-beta1 during node startup pod goes to CrashLoopBackOff

Hi,

i have 6 node ECK cluster, when ever node reboots elastic 7.5 pod is not starting it goes in 'crashloop node'. If I delete the pod, then its starting properly. I am un shure why POD goes into crashloop.

# kgpw -w
NAME                         READY   STATUS                  RESTARTS   AGE   
elastic-operator-0           1/1     Running                 8          2d22h 
elk-prd-es-default-0         1/1     Running                 0          22h   
elk-prd-es-default-1         1/1     Running                 0          2d16h 
elk-prd-es-default-2         1/1     Running                 0          2d16h 
elk-prd-es-default-3         1/1     Running                 0          17s   
elk-prd-es-default-4         0/1     Init:CrashLoopBackOff   9          22h      
elk-prd-es-default-4         0/1     Init:1/3                10         22h   
elk-prd-es-default-4         0/1     Init:Error              10         22h   
elk-prd-es-default-4         0/1     Init:CrashLoopBackOff   10         22h   

Here is the events:

Events:
  Type     Reason          Age                   From                Message
  ----     ------          ----                  ----                -------
  Warning  BackOff         21m (x5767 over 21h)  kubelet, ecknode04  Back-off restarting failed container
  Normal   SandboxChanged  17m                   kubelet, ecknode04  Pod sandbox changed, it will be killed and re-created.
  Normal   Pulled          17m                   kubelet, ecknode04  Container image "docker.elastic.co/elasticsearch/elasticsearch:7.5.0" already present on machine
  Normal   Created         17m                   kubelet, ecknode04  Created container elastic-internal-init-filesystem
  Normal   Started         17m                   kubelet, ecknode04  Started container elastic-internal-init-filesystem
  Normal   Pulled          17m (x4 over 17m)     kubelet, ecknode04  Container image "docker.elastic.co/elasticsearch/elasticsearch:7.5.0" already present on machine
  Normal   Created         17m (x4 over 17m)     kubelet, ecknode04  Created container elastic-internal-init-keystore
  Normal   Started         17m (x4 over 17m)     kubelet, ecknode04  Started container elastic-internal-init-keystore
  Warning  BackOff         2m50s (x77 over 17m)  kubelet, ecknode04  Back-off restarting failed container

any help to resolve this issue?

Hi @sfgroups1, if you look at the logs of the pod (docs here: https://www.elastic.co/guide/en/cloud-on-k8s/current/k8s-troubleshooting.html#k8s-get-elasticsearch-logs) you may get a better idea of why it is crashing.

log message not showing any error. it says pod started. something else causing pod not get to ready status.

 kgp |grep elk-prd-es-default-4
elk-prd-es-default-4         0/1     Init:CrashLoopBackOff   80         28h

k logs elk-prd-es-default-4 |tail -1
{"type": "server", "timestamp": "2019-12-08T17:03:29,608Z", "level": "INFO", "component": "o.e.n.Node", "cluster.name": "elk-prd", "node.name": "elk-prd-es-default-4", "message": "started", "cluster.uuid": "3d9MXxV2S4-226M5VR7cjA", "node.id": "LEnpjkmaSbm6ROM2XQIDSg"  }

node log show this error message.

Error syncing pod 16eff8e8-6b1a-4de7-b031-6af8d78ddb12 ("elk-prd-es-default-4_elastic-system(16eff8e8-6b1a-4de7-b031-6af8d78ddb12)"), skipping: failed to "StartContainer" for "elastic-internal-init-keystore" with CrashLoopBackOff: "back-off 5m0s restarting failed container=elastic-internal-init-keystore pod=elk-prd-es-default-4_elastic-system(16eff8e8-6b1a-4de7-b031-6af8d78ddb12)"

Can you post your Elasticsearch yaml manifest?
Also can you give us the output of the init container logs?

kubectl logs elk-prd-es-default-4 -c elastic-internal-init-keystore

Here is the output: unsure why its looking for terminal.

# kubectl logs elk-prd-es-default-0 -c elastic-internal-init-keystore
+ echo 'Initializing keystore.'
+ /usr/share/elasticsearch/bin/elasticsearch-keystore create
Initializing keystore.
Exception in thread "main" java.lang.IllegalStateException: unable to read from standard input; is standard input open and a tty attached?
        at org.elasticsearch.cli.Terminal$SystemTerminal.readText(Terminal.java:207)
        at org.elasticsearch.cli.Terminal.promptYesNo(Terminal.java:140)
        at org.elasticsearch.common.settings.CreateKeyStoreCommand.execute(CreateKeyStoreCommand.java:43)
        at org.elasticsearch.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:86)
        at org.elasticsearch.cli.Command.mainWithoutErrorHandling(Command.java:125)
        at org.elasticsearch.cli.MultiCommand.execute(MultiCommand.java:77)
        at org.elasticsearch.cli.Command.mainWithoutErrorHandling(Command.java:125)
        at org.elasticsearch.cli.Command.main(Command.java:90)
        at org.elasticsearch.common.settings.KeyStoreCli.main(KeyStoreCli.java:41)

This looks like a wrong keystore init container command.
Can you please post your elasticsearch resource yaml manifest?

Here is the yaml file. I have this issue only during server restart, if I delete the crashloop pod, then pod starting properly.

apiVersion: elasticsearch.k8s.elastic.co/v1beta1
kind: Elasticsearch
metadata:
  name: elk-prd
spec:
  version: 7.5.0  
  nodeSets:
    - name: default
      count: 5
      config:
        node.master: true
        node.data: true
        node.ingest: true
        node.store.allow_mmap: false 
      podDisruptionBudget:
        spec:
          maxUnavailable: 2
          minAvailable: 3
          selector:
            matchLabels:
              elasticsearch.k8s.elastic.co/cluster-name: elk-prd
      volumeClaimTemplates:
        - metadata:
            name: elasticsearch-data
          spec:
            accessModes:
            - ReadWriteOnce
            resources:
              requests:
                storage: 490Gi
            storageClassName: local-storage

Where are we with this issue?
I am seeing the exact same one in my ECK Cluster v1.1.2.

We run a kubernetes cluster at AWS that is being suspended every night to safe money. Before adding the secureSettings section the elasticsearch.yaml was working fine. The cluster was coming up every morning without any issue. However, this morning the cluster stuck in crashloopbackoff. We started with the usage of the secureSettings yesterday to add some gcs-credentials to the cluster. This lead unfortunately to the below listed issue:

Snipped of the elasticsearch.yaml

apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
  name: elastic
spec:
  version: {{ $.Chart.AppVersion }}
  secureSettings:
  - secretName: elastic-es-gcs-credentials
  nodeSets:
  - name: master
    count: 3
    config:
      node.master: true
      node.data: false
      node.ingest: false

Describe of the failing pod

  elastic-internal-init-keystore:
    Container ID:  docker://7eb7d89d6150d2a5983a9ea9461e7cea9c541a1443c181e1dbf96fa7f96df867
    Image:         docker.elastic.co/elasticsearch/elasticsearch:7.8.0
    Image ID:      docker-pullable://docker.elastic.co/elasticsearch/elasticsearch@sha256:161bc8c7054c622b057324618a2e8bc49ae703e64901b141a16d9c8bdd3b82f9
    Port:          <none>
    Host Port:     <none>
    Command:
      /usr/bin/env
      bash
      -c
      #!/usr/bin/env bash
      
      set -eux
      
      echo "Initializing keystore."
      
      # create a keystore in the default data path
      /usr/share/elasticsearch/bin/elasticsearch-keystore create
      
      # add all existing secret entries into it
      for filename in  /mnt/elastic-internal/secure-settings/*; do
        [[ -e "$filename" ]] || continue # glob does not match
        key=$(basename "$filename")
        echo "Adding "$key" to the keystore."
        /usr/share/elasticsearch/bin/elasticsearch-keystore add-file "$key" "$filename"
      done
      
      echo "Keystore initialization successful."
      
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Thu, 16 Jul 2020 07:31:32 +0200
      Finished:     Thu, 16 Jul 2020 07:31:33 +0200
    Ready:          False
    Restart Count:  14

Logs of the failing init container show the following:

[raulgs@raulgs-xm1 elastic]$ klogs -f pod/elastic-es-data-0 elastic-internal-init-keystore
+ echo 'Initializing keystore.'
+ /usr/share/elasticsearch/bin/elasticsearch-keystore create
Initializing keystore.
Exception in thread "main" java.lang.IllegalStateException: unable to read from standard input; is standard input open and a tty attached?
        at org.elasticsearch.cli.Terminal$SystemTerminal.readText(Terminal.java:273)
        at org.elasticsearch.cli.Terminal.promptYesNo(Terminal.java:152)
        at org.elasticsearch.common.settings.CreateKeyStoreCommand.execute(CreateKeyStoreCommand.java:51)
        at org.elasticsearch.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:86)
        at org.elasticsearch.cli.Command.mainWithoutErrorHandling(Command.java:127)
        at org.elasticsearch.cli.MultiCommand.execute(MultiCommand.java:91)
        at org.elasticsearch.cli.Command.mainWithoutErrorHandling(Command.java:127)
        at org.elasticsearch.cli.Command.main(Command.java:90)
        at org.elasticsearch.common.settings.KeyStoreCli.main(KeyStoreCli.java:43)

Deleting the pod so that it gets re-deployed by k8s fixes the issue.
However, this is only a workaround.

After pod deletion and re-deployment the log of the keystore init container looks like that:

[raulgs@raulgs-xm1 elastic]$ klogs -f pod/elastic-es-data-0 elastic-internal-init-keystore
+ echo 'Initializing keystore.'
+ /usr/share/elasticsearch/bin/elasticsearch-keystore create
Initializing keystore.
Created elasticsearch keystore in /usr/share/elasticsearch/config/elasticsearch.keystore
+ for filename in '/mnt/elastic-internal/secure-settings/*'
+ [[ -e /mnt/elastic-internal/secure-settings/gcs.client.default.credentials_file ]]
++ basename /mnt/elastic-internal/secure-settings/gcs.client.default.credentials_file
+ key=gcs.client.default.credentials_file
+ echo 'Adding gcs.client.default.credentials_file to the keystore.'
+ /usr/share/elasticsearch/bin/elasticsearch-keystore add-file gcs.client.default.credentials_file /mnt/elastic-internal/secure-settings/gcs.client.default.credentials_file
Adding gcs.client.default.credentials_file to the keystore.
+ echo 'Keystore initialization successful.'
Keystore initialization successful.

I suspect this is the underlying cause for the issue you are seeing. When you suspend and resume the cluster in the morning the init containers for the existing Pods are run again. The init-keystore init container script is not expecting repeated runs and the elasticsearch-keystore command will ask for user permission if it encounters an existing keystore in the config directory which causes the error you are seeing.

This is a known issue see https://github.com/elastic/cloud-on-k8s/issues/3294 and a fix will be included in ECK 1.2

1 Like

Yeah that looks pretty much like it. Do you know when we can expect it to be released?

Or is there a workaround that I could use in the meantime?

ECK 1.2. is available as of today https://www.elastic.co/guide/en/cloud-on-k8s/1.2/

I already saw it yesterday and upgrade directly.
I can confirm the issue with re-initialization of the key store is fixed.

Thanks a lot guys.

1 Like