Elastic Cloud on Kubernetes - Persist Fleet Server State

Hello all

Maybe someone can help me because I am at my wits end and I have not found anything that covers this case in the documentation.

I am currently setting up an elastic stack on on-prem kubernetes and my issue is that I cannot persist the state of the fleet server. Meaning if I delete my fleet server pod (to simulate a version update) a new fleet server gets added in kibana and the "old" fleet server goes offline.

In my docker compose setup I could mount a volume at /usr/share/elastic-agent/state to keep the state, but this is not working in the kubernetes setup.

I used the statefulSet deployment method and overwrite the agent-data volume to use the persistent volume claim: (partial values.yml for the helm chart below)

eck-fleet-server:
  enabled: true
  fullnameOverride: "fleet-server"
  version: 8.18.0
  serviceAccountName: fleet-server
  statefulSet:
    replicas: 1
    podTemplate:
      spec:
        volumes:
          - name: agent-data
            persistentVolumeClaim:
              claimName: agent-data
    volumeClaimTemplates:
      - metadata:
          name: agent-data
        spec:
          accessModes:
          - ReadWriteOnce
          resources:
            requests:
              storage: 500Mi
          storageClassName: ceph-block
  policyID: elk-fleet-server
  kibanaRef:
    name: kibana
  elasticsearchRefs:
  - name: elasticsearch

This setup results (at least in my opinion) in a correct pod creation. (abbreviated kubectl describe below)

...
Containers:
  agent:
    Container ID:   containerd://...
    Image:          docker.elastic.co/beats/elastic-agent:8.18.0
    Image ID:       docker.elastic.co/beats/elastic-agent@sha256:c26375e5870b1efa211b5820645346784dc7a08e687314fe739aba294647ab5f
    Port:           8220/TCP
    Host Port:      0/TCP
    State:          Running
      Started:      Wed, 07 May 2025 10:53:46 +0200
    Ready:          True
...
    Mounts:
      /etc/agent.yml from config (ro,path="agent.yml")
      /mnt/elastic-internal/elasticsearch-association/elk-test/elasticsearch/certs from elasticsearch-certs (ro)
      /usr/share/elastic-agent/state from agent-data (rw)
      /usr/share/fleet-server/config/http-certs from fleet-certs (ro)
Volumes:
  agent-data:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  agent-data-fleet-server-agent-0
    ReadOnly:   false
...

I have also tried the setup with:

  • Creating a new volume and mounting at /usr/share/elastic-agent/state -> no change
  • Trying to store /usr/share/elastic-agent in a volume -> container failed to boot

Am I doing something wrong, have I overlooked something in the documentation or is this even possible to achieve?

Thanks in advance

i'm having the same problem on docker, with compose. It worked on 8.17.4 and .6 when upgraded to 8.18.1 this problem started to happen. Every time i restart the compose the fleet server gets a new "id". Have you tried with 8.18.X on compose? Or did you use a older version? It may be something introduced in the newest version?

I do not use ECK, but on the cloud this is how it works, for example, when you upgrade or change anything on the integration server, it will spin-up a new fleet server.

Maybe this is by default? The default policy also has a small inactivity time of just 24 hours, so offline agents will not appear for long.

This seems unrelated as this is about running on ECK, not docker compose.

I suggest that you open another topic and share your docker-compose and the issue you are having.

You are right it is probably working as intended. I have checked the agent policy UI on kibana and the inactive agent unenrollment timeout specifically states:

... This can be useful for policies containing ephemeral agents, such as those in a Docker or Kubernetes environment.

If anyone stumbles across this post.
I have updated my fleet agent policy with inactivity_timeout and unenroll_timeout, to keep my agent list clean.

xpack.fleet.agentPolicies:
    - name: Fleet Server policy
      id: elk-fleet-server
      namespace: default
      is_managed: true
      monitoring_enabled:
      - logs
      - metrics
      inactivity_timeout: 900 # remove from UI after 15min
      unenroll_timeout: 86400 # unenroll agent after 24h
      package_policies:
      - name: fleet_server-1
        id: fleet_server-1
        package:
          name: fleet_server