ECK Fleet startup failure on ECK on Cloud

vaibhavsw · August 17, 2021, 2:13pm

I followed the latest documentation to start a fleet server on Kubernetes, though both Elastic Agent and Fleet Server are throwing below error.

cp: cannot stat '/mnt/elastic-internal/elasticsearch-association/vulcan/elasticsearch/certs/ca.crt': No such file or directory

Role & ClusterRoleBinding

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: fleet-server
rules:
- apiGroups: [""]
  resources:
  - pods
  verbs:
  - get
  - watch
  - list
- apiGroups: ["coordination.k8s.io"]
  resources:
  - leases
  verbs:
  - get
  - create
  - update
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: fleet-server
  namespace: vulcan
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: fleet-server
subjects:
- kind: ServiceAccount
  name: fleet-server
  namespace: vulcan
roleRef:
  kind: ClusterRole
  name: fleet-server
  apiGroup: rbac.authorization.k8s.io
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: elastic-agent
rules:
- apiGroups: [""]
  resources:
  - pods
  - nodes
  - namespaces
  - events
  - services
  - configmaps
  verbs:
  - get
  - watch
  - list
- apiGroups: ["coordination.k8s.io"]
  resources:
  - leases
  verbs:
  - get
  - create
  - update
- nonResourceURLs:
  - "/metrics"
  verbs:
  - get
- apiGroups: ["extensions"]
  resources:
    - replicasets
  verbs: 
  - "get"
  - "list"
  - "watch"
- apiGroups:
  - "apps"
  resources:
  - statefulsets
  - deployments
  - replicasets
  verbs:
  - "get"
  - "list"
  - "watch"
- apiGroups:
  - ""
  resources:
  - nodes/stats
  verbs:
  - get
- apiGroups:
  - "batch"
  resources:
  - jobs
  verbs:
  - "get"
  - "list"
  - "watch"
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: elastic-agent
  namespace: vulcan
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: elastic-agent
subjects:
- kind: ServiceAccount
  name: elastic-agent
  namespace: vulcan
roleRef:
  kind: ClusterRole
  name: elastic-agent
  apiGroup: rbac.authorization.k8s.io

Elastic Agent & Fleet Server

apiVersion: agent.k8s.elastic.co/v1alpha1
kind: Agent
metadata:
  name: fleet-server
  namespace: vulcan
spec:
  version: 7.14.0
  kibanaRef:
    name: kibana
  elasticsearchRefs:
  - name: elasticsearch
  mode: fleet
  fleetServerEnabled: true
  deployment:
    replicas: 1
    podTemplate:
      spec:
        serviceAccountName: fleet-server
        automountServiceAccountToken: true
        securityContext:
          runAsUser: 0
---
apiVersion: agent.k8s.elastic.co/v1alpha1
kind: Agent
metadata:
  name: elastic-agent
  namespace: vulcan
spec:
  version: 7.14.0
  kibanaRef:
    name: kibana
  fleetServerRef:
    name: fleet-server
  mode: fleet
  daemonSet:
    podTemplate:
      spec:
        serviceAccountName: elastic-agent
        automountServiceAccountToken: true
        securityContext:
          runAsUser: 0

I checked the final YAML that is generated after the fleet server is deployed and I see there is no volume that is mounted on the path from where ca.crt is trying to be copied.

keiransteele · August 18, 2021, 11:51pm

I have encountered this issue too when deploying Fleet Server on Kubernetes with 7.14.0.

Name:         fleet-server-agent-7bfb544857-9nqfs
Namespace:    elastic-system
Priority:     0
Node:         
Start Time:   Wed, 18 Aug 2021 22:34:09 +0000
Labels:       agent.k8s.elastic.co/config-checksum=b9f6e9c55e56fbf03ab67b2f10155bae1f1a5e888b390deae5572e80
              agent.k8s.elastic.co/name=fleet-server
              agent.k8s.elastic.co/version=7.14.0
              common.k8s.elastic.co/type=agent
              pod-template-hash=7bfb544857
Annotations:  <none>
Status:       Running
IP:           10.42.0.19
IPs:
  IP:           10.42.0.19
Controlled By:  ReplicaSet/fleet-server-agent-7bfb544857
Containers:
  agent:
    Container ID:  containerd://09f3ccbd00e619186977a7b32f1337d31a778f975e550bac1dc158b27d966148
    Image:         docker.elastic.co/beats/elastic-agent:7.14.0
    Image ID:      docker.elastic.co/beats/elastic-agent@sha256:d479c991c9a32bc53976d88215103be4a0e1c4a48826ef2d954ad04f315c79bc
    Port:          8220/TCP
    Host Port:     0/TCP
    Command:
      /usr/bin/env
      bash
      -c
      #!/usr/bin/env bash
      set -e
      cp /mnt/elastic-internal/elasticsearch-association/elastic-system/elasticsearch/certs/ca.crt /etc/pki/ca-trust/source/anchors/
      update-ca-trust
      /usr/bin/tini -- /usr/local/bin/docker-entrypoint -e

    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Wed, 18 Aug 2021 23:46:31 +0000
      Finished:     Wed, 18 Aug 2021 23:46:31 +0000
    Ready:          False
    Restart Count:  19
    Limits:
      cpu:     200m
      memory:  1Gi
    Requests:
      cpu:     200m
      memory:  1Gi
    Environment:
      CONFIG_PATH:  /usr/share/elastic-agent
      NODE_NAME:     (v1:spec.nodeName)
    Mounts:
      /etc/agent.yml from config (ro,path="agent.yml")
      /usr/share/elastic-agent/fleet-setup.yml from fleet-setup-config (ro,path="fleet-setup.yml")
      /usr/share/fleet-server/config/http-certs from fleet-certs (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-ggsr5 (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  config:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  fleet-server-agent-config
    Optional:    false
  fleet-certs:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  fleet-server-agent-http-certs-internal
    Optional:    false
  fleet-setup-config:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  fleet-server-agent-config
    Optional:    false
  kube-api-access-ggsr5:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Guaranteed
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason   Age                    From     Message
  ----     ------   ----                   ----     -------
  Warning  BackOff  3m15s (x323 over 73m)  kubelet  Back-off restarting failed container

keiransteele · August 19, 2021, 9:17am

I compared the fleet server deployment with an agent deployment without fleet and came up with what I believe is the correct config for the deployment:

    Mounts:
      /etc/agent.yml from config (ro,path="agent.yml")
      /mnt/elastic-internal/elasticsearch-association/elastic-system/elasticsearch/certs from elasticsearch-certs-0 (ro)
      /usr/share/elastic-agent/fleet-setup.yml from fleet-setup-config (ro,path="fleet-setup.yml")
      /usr/share/fleet-server/config/http-certs from fleet-certs (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-9g5dw (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  config:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  fleet-server-agent-config
    Optional:    false
  fleet-certs:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  fleet-server-agent-http-certs-internal
    Optional:    false
  fleet-setup-config:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  fleet-server-agent-config
    Optional:    false
  elasticsearch-certs-0:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  fleet-server-agent-es-elastic-system-elasticsearch-ca
    Optional:    false

but then I get a new error and this one doesn't have any apparent code that is calling this path or corresponding secret that is created like the above error

Error: 1 error: open /mnt/elastic-internal/kibana-association/elastic-system/kibana/certs/ca.crt: no such file or directory reading <nil>

vaibhavsw · August 19, 2021, 11:14am

Any solution for this. Was anyone able to resolve it?

r0zbot · August 29, 2021, 2:54pm

I managed to get it to start by applying the following patch to the deployment and daemonset created.

So for fleet-agent:

spec:
  template:
    spec:
      containers:
      - name: agent
        volumeMounts:
        - mountPath: /mnt/elastic-internal/kibana-association/elastic-system/siem/certs
          name: kibana-certificates-temp
          readOnly: true
        - mountPath: /mnt/elastic-internal/fleetserver-association/elastic-system/siem-fleet-server/certs
          name: fleetserver-certificates-temp
          readOnly: true
      volumes:
      - name: kibana-certificates-temp
        secret:
          defaultMode: 420
          optional: false
          secretName: siem-kb-es-ca
      - name: fleetserver-certificates-temp
        secret:
          defaultMode: 420
          optional: false
          secretName: siem-fleet-server-agent-http-certs-internal

And this one to fleet-server:

spec:
  template:
    spec:
      containers:
      - name: agent
        volumeMounts:
        - mountPath: /mnt/elastic-internal/elasticsearch-association/elastic-system/siem/certs
          name: fleet-certificates-temp
          readOnly: true
        - mountPath: /mnt/elastic-internal/kibana-association/elastic-system/siem/certs
          name: kibana-certificates-temp
          readOnly: true
      volumes:
      - name: fleet-certificates-temp
        secret:
          defaultMode: 420
          optional: false
          secretName: siem-fleet-server-agent-es-elastic-system-siem-ca
      - name: kibana-certificates-temp
        secret:
          defaultMode: 420
          optional: false
          secretName: siem-kb-es-ca

This indeed looks like an issue with elastic-operator or ECK v1.7.1 in general.

pebrc · August 30, 2021, 11:54am

@vaibhavsw Could you share your Elasticsearch/Kibana manifests as well? Where you using custom HTTP certificates for those by any chance?

keiransteele · August 30, 2021, 8:10pm

@r0zbot are you getting metrics and logs from the agent? I made the same changes you did and was able to get an agent to register to the fleet server and both show up in Kibana but there is no data coming through.

r0zbot · August 31, 2021, 9:30pm

No, we haven't been able to get that working yet. There may be some additional mounts needed.

BenB196 · August 31, 2021, 11:17pm

I wanted to drop a reply here as I came across a similar issue, but was able to resolve it.

Something that is not very clear (upfront) in the ECK docs, is that if you are using a custom HTTP cert for Kibana and Elasticsearch (and Fleet Server), is that the secret files that contains the tls.crt and tls.key values should also contain the root ca.crt and any intermediate ca certs. The mentions of this are 1. here in the actual command, and 2. here.

The command for mention 1. only takes into account self-signed certs, but if you have a cert signed by a private CA, then you can replace --from-file=ca.crt=tls.crt with --from-file=ca.crt=<your_ca_crt>

If you don't have the root ca.crt and any intermediate ca certs in the secret file, they can't be mounted into the Agent deployment for the certs to be validated.

Also, if you use something like Letsencrypt for your cert gen, then you're probably running into this issue

The reason you probably aren't getting any logs is because the Agent is failing to validate the cert of the Elasticsearch HTTPS endpoint. If you were to mount the Elasticsearch CA, like you did the Fleet and Kibana ones it should work, or you can make the above change and add the ca's to the secret files.

dkow · September 2, 2021, 10:27am

Thanks everyone for your input.

There is indeed a bug in the ECK Agent controller that leads to this behavior. Please see GitHub issue for bug description and workaround. The fix is planned to be released in 1.8.0.

Thanks,
David

system · September 30, 2021, 10:27am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
ECK - Fleet Server agent startup failure Elastic Cloud on Kubernetes (ECK)	12	2817	October 21, 2021
Failed to deploy fleet-server via eck in kubernetes Elastic Cloud on Kubernetes (ECK) docker	1	650	April 2, 2023
Fleet Server does not come up - ECK Elastic Agent fleet	9	955	August 29, 2023
Fleet Server not deploying in ECK operator Elastic Agent fleet	2	437	August 21, 2023
[8.1] Fleet server not registered using ECK configuration from docs Elastic Cloud on Kubernetes (ECK) fleet	7	882	April 25, 2022

ECK Fleet startup failure on ECK on Cloud

Related topics