ECK Fleet startup failure on ECK on Cloud

I followed the latest documentation to start a fleet server on Kubernetes, though both Elastic Agent and Fleet Server are throwing below error.

cp: cannot stat '/mnt/elastic-internal/elasticsearch-association/vulcan/elasticsearch/certs/ca.crt': No such file or directory

Role & ClusterRoleBinding

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: fleet-server
rules:
- apiGroups: [""]
  resources:
  - pods
  verbs:
  - get
  - watch
  - list
- apiGroups: ["coordination.k8s.io"]
  resources:
  - leases
  verbs:
  - get
  - create
  - update
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: fleet-server
  namespace: vulcan
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: fleet-server
subjects:
- kind: ServiceAccount
  name: fleet-server
  namespace: vulcan
roleRef:
  kind: ClusterRole
  name: fleet-server
  apiGroup: rbac.authorization.k8s.io
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: elastic-agent
rules:
- apiGroups: [""]
  resources:
  - pods
  - nodes
  - namespaces
  - events
  - services
  - configmaps
  verbs:
  - get
  - watch
  - list
- apiGroups: ["coordination.k8s.io"]
  resources:
  - leases
  verbs:
  - get
  - create
  - update
- nonResourceURLs:
  - "/metrics"
  verbs:
  - get
- apiGroups: ["extensions"]
  resources:
    - replicasets
  verbs: 
  - "get"
  - "list"
  - "watch"
- apiGroups:
  - "apps"
  resources:
  - statefulsets
  - deployments
  - replicasets
  verbs:
  - "get"
  - "list"
  - "watch"
- apiGroups:
  - ""
  resources:
  - nodes/stats
  verbs:
  - get
- apiGroups:
  - "batch"
  resources:
  - jobs
  verbs:
  - "get"
  - "list"
  - "watch"
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: elastic-agent
  namespace: vulcan
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: elastic-agent
subjects:
- kind: ServiceAccount
  name: elastic-agent
  namespace: vulcan
roleRef:
  kind: ClusterRole
  name: elastic-agent
  apiGroup: rbac.authorization.k8s.io

Elastic Agent & Fleet Server

apiVersion: agent.k8s.elastic.co/v1alpha1
kind: Agent
metadata:
  name: fleet-server
  namespace: vulcan
spec:
  version: 7.14.0
  kibanaRef:
    name: kibana
  elasticsearchRefs:
  - name: elasticsearch
  mode: fleet
  fleetServerEnabled: true
  deployment:
    replicas: 1
    podTemplate:
      spec:
        serviceAccountName: fleet-server
        automountServiceAccountToken: true
        securityContext:
          runAsUser: 0
---
apiVersion: agent.k8s.elastic.co/v1alpha1
kind: Agent
metadata:
  name: elastic-agent
  namespace: vulcan
spec:
  version: 7.14.0
  kibanaRef:
    name: kibana
  fleetServerRef:
    name: fleet-server
  mode: fleet
  daemonSet:
    podTemplate:
      spec:
        serviceAccountName: elastic-agent
        automountServiceAccountToken: true
        securityContext:
          runAsUser: 0

I checked the final YAML that is generated after the fleet server is deployed and I see there is no volume that is mounted on the path from where ca.crt is trying to be copied.

I have encountered this issue too when deploying Fleet Server on Kubernetes with 7.14.0.

Name:         fleet-server-agent-7bfb544857-9nqfs
Namespace:    elastic-system
Priority:     0
Node:         
Start Time:   Wed, 18 Aug 2021 22:34:09 +0000
Labels:       agent.k8s.elastic.co/config-checksum=b9f6e9c55e56fbf03ab67b2f10155bae1f1a5e888b390deae5572e80
              agent.k8s.elastic.co/name=fleet-server
              agent.k8s.elastic.co/version=7.14.0
              common.k8s.elastic.co/type=agent
              pod-template-hash=7bfb544857
Annotations:  <none>
Status:       Running
IP:           10.42.0.19
IPs:
  IP:           10.42.0.19
Controlled By:  ReplicaSet/fleet-server-agent-7bfb544857
Containers:
  agent:
    Container ID:  containerd://09f3ccbd00e619186977a7b32f1337d31a778f975e550bac1dc158b27d966148
    Image:         docker.elastic.co/beats/elastic-agent:7.14.0
    Image ID:      docker.elastic.co/beats/elastic-agent@sha256:d479c991c9a32bc53976d88215103be4a0e1c4a48826ef2d954ad04f315c79bc
    Port:          8220/TCP
    Host Port:     0/TCP
    Command:
      /usr/bin/env
      bash
      -c
      #!/usr/bin/env bash
      set -e
      cp /mnt/elastic-internal/elasticsearch-association/elastic-system/elasticsearch/certs/ca.crt /etc/pki/ca-trust/source/anchors/
      update-ca-trust
      /usr/bin/tini -- /usr/local/bin/docker-entrypoint -e

    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Wed, 18 Aug 2021 23:46:31 +0000
      Finished:     Wed, 18 Aug 2021 23:46:31 +0000
    Ready:          False
    Restart Count:  19
    Limits:
      cpu:     200m
      memory:  1Gi
    Requests:
      cpu:     200m
      memory:  1Gi
    Environment:
      CONFIG_PATH:  /usr/share/elastic-agent
      NODE_NAME:     (v1:spec.nodeName)
    Mounts:
      /etc/agent.yml from config (ro,path="agent.yml")
      /usr/share/elastic-agent/fleet-setup.yml from fleet-setup-config (ro,path="fleet-setup.yml")
      /usr/share/fleet-server/config/http-certs from fleet-certs (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-ggsr5 (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  config:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  fleet-server-agent-config
    Optional:    false
  fleet-certs:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  fleet-server-agent-http-certs-internal
    Optional:    false
  fleet-setup-config:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  fleet-server-agent-config
    Optional:    false
  kube-api-access-ggsr5:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Guaranteed
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason   Age                    From     Message
  ----     ------   ----                   ----     -------
  Warning  BackOff  3m15s (x323 over 73m)  kubelet  Back-off restarting failed container

I compared the fleet server deployment with an agent deployment without fleet and came up with what I believe is the correct config for the deployment:

    Mounts:
      /etc/agent.yml from config (ro,path="agent.yml")
      /mnt/elastic-internal/elasticsearch-association/elastic-system/elasticsearch/certs from elasticsearch-certs-0 (ro)
      /usr/share/elastic-agent/fleet-setup.yml from fleet-setup-config (ro,path="fleet-setup.yml")
      /usr/share/fleet-server/config/http-certs from fleet-certs (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-9g5dw (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  config:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  fleet-server-agent-config
    Optional:    false
  fleet-certs:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  fleet-server-agent-http-certs-internal
    Optional:    false
  fleet-setup-config:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  fleet-server-agent-config
    Optional:    false
  elasticsearch-certs-0:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  fleet-server-agent-es-elastic-system-elasticsearch-ca
    Optional:    false

but then I get a new error and this one doesn't have any apparent code that is calling this path or corresponding secret that is created like the above error

Error: 1 error: open /mnt/elastic-internal/kibana-association/elastic-system/kibana/certs/ca.crt: no such file or directory reading <nil>

Any solution for this. Was anyone able to resolve it?

I managed to get it to start by applying the following patch to the deployment and daemonset created.

So for fleet-agent:

spec:
  template:
    spec:
      containers:
      - name: agent
        volumeMounts:
        - mountPath: /mnt/elastic-internal/kibana-association/elastic-system/siem/certs
          name: kibana-certificates-temp
          readOnly: true
        - mountPath: /mnt/elastic-internal/fleetserver-association/elastic-system/siem-fleet-server/certs
          name: fleetserver-certificates-temp
          readOnly: true
      volumes:
      - name: kibana-certificates-temp
        secret:
          defaultMode: 420
          optional: false
          secretName: siem-kb-es-ca
      - name: fleetserver-certificates-temp
        secret:
          defaultMode: 420
          optional: false
          secretName: siem-fleet-server-agent-http-certs-internal

And this one to fleet-server:

spec:
  template:
    spec:
      containers:
      - name: agent
        volumeMounts:
        - mountPath: /mnt/elastic-internal/elasticsearch-association/elastic-system/siem/certs
          name: fleet-certificates-temp
          readOnly: true
        - mountPath: /mnt/elastic-internal/kibana-association/elastic-system/siem/certs
          name: kibana-certificates-temp
          readOnly: true
      volumes:
      - name: fleet-certificates-temp
        secret:
          defaultMode: 420
          optional: false
          secretName: siem-fleet-server-agent-es-elastic-system-siem-ca
      - name: kibana-certificates-temp
        secret:
          defaultMode: 420
          optional: false
          secretName: siem-kb-es-ca

This indeed looks like an issue with elastic-operator or ECK v1.7.1 in general.

@vaibhavsw Could you share your Elasticsearch/Kibana manifests as well? Where you using custom HTTP certificates for those by any chance?

@r0zbot are you getting metrics and logs from the agent? I made the same changes you did and was able to get an agent to register to the fleet server and both show up in Kibana but there is no data coming through.

No, we haven't been able to get that working yet. There may be some additional mounts needed.

I wanted to drop a reply here as I came across a similar issue, but was able to resolve it.

Something that is not very clear (upfront) in the ECK docs, is that if you are using a custom HTTP cert for Kibana and Elasticsearch (and Fleet Server), is that the secret files that contains the tls.crt and tls.key values should also contain the root ca.crt and any intermediate ca certs. The mentions of this are 1. here in the actual command, and 2. here.

The command for mention 1. only takes into account self-signed certs, but if you have a cert signed by a private CA, then you can replace --from-file=ca.crt=tls.crt with --from-file=ca.crt=<your_ca_crt>

If you don't have the root ca.crt and any intermediate ca certs in the secret file, they can't be mounted into the Agent deployment for the certs to be validated.

Also, if you use something like Letsencrypt for your cert gen, then you're probably running into this issue

The reason you probably aren't getting any logs is because the Agent is failing to validate the cert of the Elasticsearch HTTPS endpoint. If you were to mount the Elasticsearch CA, like you did the Fleet and Kibana ones it should work, or you can make the above change and add the ca's to the secret files.

Thanks everyone for your input.

There is indeed a bug in the ECK Agent controller that leads to this behavior. Please see GitHub issue for bug description and workaround. The fix is planned to be released in 1.8.0.

Thanks,
David