[Solved] Elastic agent as fleet server auto shutdown for no reason in K8S

Contextualization

I'm trying to run a fleet server using elastic agent in k8s
Using docker img: docker.elastic.co/beats/elastic-agent:7.13.2
Elasticsearch Ver: 7.13.2
Kibana Ver: 7.13.2

What happened

When the fleet server start, it auto register in kibana like this:


but several seconds later, it start to shutting down.

The agent can connect to kibana and ES.

Configuration

I didn't change the starting commands. I only add several env var:

extraEnvs:
  - name: FLEET_SERVER_ENABLE
    value: "1"
  - name: KIBANA_FLEET_SETUP
    value: "1"
  - name: FLEET_ENROLL
    value: "1"
  - name: FLEET_INSECURE
    value: "1"
  - name: FLEET_SERVER_PORT
    value: "8220"
  - name: KIBANA_HOST
    value: "http://kibana:5601"
  - name: KIBANA_USERNAME
    valueFrom:
      secretKeyRef:
        name: elastic-credentials
        key: username
  - name: KIBANA_PASSWORD
    valueFrom:
      secretKeyRef:
        name: elastic-credentials
        key: password
  - name: ELASTICSEARCH_HOST
    value: "http://elasticsearch:9200"
  - name: ELASTICSEARCH_USERNAME
    valueFrom:
      secretKeyRef:
        name: elastic-credentials
        key: username
  - name: ELASTICSEARCH_PASSWORD
    valueFrom:
      secretKeyRef:
        name: elastic-credentials
        key: password

Log

Performing setup of Fleet in Kibana

Policy selected for enrollment:  
The Elastic Agent is currently in BETA and should not be used in production

2021-07-04T07:57:07.289Z	INFO	cmd/enroll_cmd.go:300	Generating self-signed certificate for Fleet Server
2021-07-04T07:57:08.515Z	INFO	cmd/enroll_cmd.go:468	Spawning Elastic Agent daemon as a subprocess to complete bootstrap process.
2021-07-04T07:57:08.733Z	INFO	warn/warn.go:18	The Elastic Agent is currently in BETA and should not be used in production
2021-07-04T07:57:08.734Z	INFO	application/application.go:68	Detecting execution mode
2021-07-04T07:57:08.735Z	INFO	application/application.go:89	Agent is in Fleet Server bootstrap mode
2021-07-04T07:57:09.341Z	INFO	[api]	api/server.go:62	Starting stats endpoint
2021-07-04T07:57:09.341Z	INFO	application/fleet_server_bootstrap.go:124	Agent is starting
2021-07-04T07:57:09.341Z	INFO	[api]	api/server.go:64	Metrics endpoint listening on: /usr/share/elastic-agent/state/data/tmp/elastic-agent.sock (configured: unix:///usr/share/elastic-agent/state/data/tmp/elastic-agent.sock)
2021-07-04T07:57:09.342Z	INFO	application/fleet_server_bootstrap.go:134	Agent is stopped
2021-07-04T07:57:09.345Z	INFO	stateresolver/stateresolver.go:48	New State ID is BkAFCwJp
2021-07-04T07:57:09.346Z	INFO	stateresolver/stateresolver.go:49	Converging state requires execution of 1 step(s)
2021-07-04T07:57:10.762Z	INFO	log/reporter.go:40	2021-07-04T07:57:10Z - message: Application: fleet-server--7.13.2[]: State changed to STARTING: Starting - type: 'STATE' - sub_type: 'STARTING'
2021-07-04T07:57:10.764Z	INFO	stateresolver/stateresolver.go:66	Updating internal state
2021-07-04T07:57:11.526Z	INFO	cmd/enroll_cmd.go:643	Fleet Server - Starting
2021-07-04T07:57:12.311Z	WARN	status/reporter.go:236	Elastic Agent status changed to: 'degraded'
2021-07-04T07:57:12.312Z	INFO	log/reporter.go:40	2021-07-04T07:57:12Z - message: Application: fleet-server--7.13.2[]: State changed to DEGRADED: Running on default policy with Fleet Server integration; missing config fleet.agent.id (expected during bootstrap process) - type: 'STATE' - sub_type: 'RUNNING'
2021-07-04T07:57:12.529Z	INFO	cmd/enroll_cmd.go:624	Fleet Server - Running on default policy with Fleet Server integration; missing config fleet.agent.id (expected during bootstrap process)
2021-07-04T07:57:13.080Z	WARN	[tls]	tlscommon/tls_config.go:98	SSL/TLS verifications disabled.
2021-07-04T07:57:16.906Z	INFO	cmd/enroll_cmd.go:206	Elastic Agent has been enrolled; start Elastic Agent
2021-07-04T07:57:16.906Z	INFO	cmd/run.go:189	Shutting down Elastic Agent and sending last events...
2021-07-04T07:57:16.907Z	INFO	operation/operator.go:191	waiting for installer of pipeline 'default' to finish
2021-07-04T07:57:16.907Z	INFO	process/app.go:176	Signaling application to stop because of shutdown: fleet-server--7.13.2
2021-07-04T07:57:17.407Z	INFO	status/reporter.go:236	Elastic Agent status changed to: 'online'
2021-07-04T07:57:17.408Z	INFO	log/reporter.go:40	2021-07-04T07:57:17Z - message: Application: fleet-server--7.13.2[]: State changed to STOPPED: Stopped - type: 'STATE' - sub_type: 'STOPPED'
2021-07-04T07:57:17.408Z	INFO	cmd/run.go:197	Shutting down completed.
2021-07-04T07:57:17.408Z	INFO	[api]	api/server.go:66	Stats endpoint (/usr/share/elastic-agent/state/data/tmp/elastic-agent.sock) finished: accept unix /usr/share/elastic-agent/state/data/tmp/elastic-agent.sock: use of closed network connection
Successfully enrolled the Elastic Agent.
2021-07-04T07:57:17.526Z	INFO	warn/warn.go:18	The Elastic Agent is currently in BETA and should not be used in production
2021-07-04T07:57:17.526Z	INFO	application/application.go:68	Detecting execution mode
2021-07-04T07:57:17.527Z	INFO	application/application.go:93	Agent is managed by Fleet
2021-07-04T07:57:17.527Z	INFO	capabilities/capabilities.go:59	capabilities file not found in /usr/share/elastic-agent/state/capabilities.yml
2021-07-04T07:57:18.007Z	INFO	[composable]	composable/controller.go:46	EXPERIMENTAL - Inputs with variables are currently experimental and should not be used in production
I0704 07:57:18.216288       6 leaderelection.go:243] attempting to acquire leader lease  elastic-stack/elastic-agent-cluster-leader...
2021-07-04T07:57:18.219Z	INFO	[composable.providers.docker]	docker/docker.go:43	Docker provider skipped, unable to connect: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
2021-07-04T07:57:18.221Z	INFO	[composable.providers.kubernetes]	kubernetes/kubernetes.go:64	Kubernetes provider started with node scope
2021-07-04T07:57:18.221Z	INFO	[composable.providers.kubernetes]	kubernetes/util.go:114	kubernetes: Using pod name fleet-server-0 and namespace elastic-stack to discover kubernetes node
2021-07-04T07:57:18.242Z	INFO	[composable.providers.kubernetes]	kubernetes/util.go:120	kubernetes: Using node k8s-production.hkr.org discovered by in cluster pod node query
2021-07-04T07:57:18.343Z	INFO	[api]	api/server.go:62	Starting stats endpoint
2021-07-04T07:57:18.343Z	INFO	application/managed_mode.go:291	Agent is starting
2021-07-04T07:57:18.344Z	INFO	[api]	api/server.go:64	Metrics endpoint listening on: /usr/share/elastic-agent/state/data/tmp/elastic-agent.sock (configured: unix:///usr/share/elastic-agent/state/data/tmp/elastic-agent.sock)
2021-07-04T07:57:18.446Z	WARN	application/managed_mode.go:304	failed to ack update open /usr/share/elastic-agent/state/data/.update-marker: no such file or directory
2021-07-04T07:57:18.453Z	INFO	stateresolver/stateresolver.go:48	New State ID is 7DXuPtyK
2021-07-04T07:57:18.453Z	INFO	stateresolver/stateresolver.go:49	Converging state requires execution of 2 step(s)
2021-07-04T07:57:18.538Z	INFO	operation/operator.go:259	operation 'operation-install' skipped for fleet-server.7.13.2
2021-07-04T07:57:18.821Z	INFO	log/reporter.go:40	2021-07-04T07:57:18Z - message: Application: fleet-server--7.13.2[a303e8dc-b23b-4558-b78c-13f9b86df96a]: State changed to STARTING: Starting - type: 'STATE' - sub_type: 'STARTING'
2021-07-04T07:57:19.868Z	INFO	log/reporter.go:40	2021-07-04T07:57:19Z - message: Application: fleet-server--7.13.2[a303e8dc-b23b-4558-b78c-13f9b86df96a]: State changed to RUNNING: Running on default policy with Fleet Server integration - type: 'STATE' - sub_type: 'RUNNING'
2021-07-04T07:57:28.911Z	INFO	log/reporter.go:40	2021-07-04T07:57:28Z - message: Application: filebeat--7.13.2--36643631373035623733363936343635[a303e8dc-b23b-4558-b78c-13f9b86df96a]: State changed to STARTING: Starting - type: 'STATE' - sub_type: 'STARTING'
2021-07-04T07:57:30.171Z	INFO	log/reporter.go:40	2021-07-04T07:57:30Z - message: Application: filebeat--7.13.2--36643631373035623733363936343635[a303e8dc-b23b-4558-b78c-13f9b86df96a]: State changed to RUNNING: Running - type: 'STATE' - sub_type: 'RUNNING'
2021-07-04T07:57:35.027Z	INFO	log/reporter.go:40	2021-07-04T07:57:35Z - message: Application: metricbeat--7.13.2--36643631373035623733363936343635[a303e8dc-b23b-4558-b78c-13f9b86df96a]: State changed to STARTING: Starting - type: 'STATE' - sub_type: 'STARTING'
2021-07-04T07:57:35.040Z	INFO	stateresolver/stateresolver.go:66	Updating internal state
2021-07-04T07:57:35.046Z	INFO	stateresolver/stateresolver.go:48	New State ID is 7DXuPtyK
2021-07-04T07:57:35.047Z	INFO	stateresolver/stateresolver.go:49	Converging state requires execution of 0 step(s)
2021-07-04T07:57:35.047Z	INFO	stateresolver/stateresolver.go:66	Updating internal state
I0704 07:57:36.043246       6 leaderelection.go:253] successfully acquired lease elastic-stack/elastic-agent-cluster-leader
2021-07-04T07:57:36.203Z	INFO	stateresolver/stateresolver.go:48	New State ID is 7DXuPtyK
2021-07-04T07:57:36.204Z	INFO	stateresolver/stateresolver.go:49	Converging state requires execution of 0 step(s)
2021-07-04T07:57:36.204Z	INFO	stateresolver/stateresolver.go:66	Updating internal state
2021-07-04T07:57:36.391Z	INFO	log/reporter.go:40	2021-07-04T07:57:36Z - message: Application: metricbeat--7.13.2--36643631373035623733363936343635[a303e8dc-b23b-4558-b78c-13f9b86df96a]: State changed to RUNNING: Running - type: 'STATE' - sub_type: 'RUNNING'
2021-07-04T07:57:37.040Z	INFO	cmd/run.go:189	Shutting down Elastic Agent and sending last events...
2021-07-04T07:57:37.041Z	INFO	operation/operator.go:191	waiting for installer of pipeline 'default' to finish
2021-07-04T07:57:37.041Z	INFO	process/app.go:176	Signaling application to stop because of shutdown: metricbeat--7.13.2--36643631373035623733363936343635
2021-07-04T07:57:37.208Z	ERROR	fleet/fleet_gateway.go:167	context canceled
2021-07-04T07:57:37.208Z	ERROR	status/reporter.go:236	Elastic Agent status changed to: 'error'
2021-07-04T07:57:37.392Z	INFO	log/reporter.go:40	2021-07-04T07:57:37Z - message: Application: metricbeat--7.13.2--36643631373035623733363936343635[a303e8dc-b23b-4558-b78c-13f9b86df96a]: State changed to STOPPING: Stopping - type: 'STATE' - sub_type: 'STOPPING'
2021-07-04T07:57:37.542Z	INFO	log/reporter.go:40	2021-07-04T07:57:37Z - message: Application: fleet-server--7.13.2[a303e8dc-b23b-4558-b78c-13f9b86df96a]: State changed to STOPPED: Stopped - type: 'STATE' - sub_type: 'STOPPED'
2021-07-04T07:57:37.542Z	INFO	log/reporter.go:40	2021-07-04T07:57:37Z - message: Application: filebeat--7.13.2--36643631373035623733363936343635[a303e8dc-b23b-4558-b78c-13f9b86df96a]: State changed to STOPPED: Stopped - type: 'STATE' - sub_type: 'STOPPED'
2021-07-04T07:57:38.528Z	ERROR	fleet/fleet_gateway.go:167	context canceled
2021-07-04T07:57:38.542Z	INFO	process/app.go:176	Signaling application to stop because of shutdown: fleet-server--7.13.2
2021-07-04T07:57:38.542Z	INFO	process/app.go:176	Signaling application to stop because of shutdown: filebeat--7.13.2--36643631373035623733363936343635
2021-07-04T07:57:38.542Z	INFO	log/reporter.go:40	2021-07-04T07:57:38Z - message: Application: metricbeat--7.13.2--36643631373035623733363936343635[a303e8dc-b23b-4558-b78c-13f9b86df96a]: State changed to STOPPED: Stopped - type: 'STATE' - sub_type: 'STOPPED'
2021-07-04T07:57:38.542Z	INFO	application/managed_mode.go:320	Agent is stopped
2021-07-04T07:57:38.543Z	INFO	cmd/run.go:197	Shutting down completed.
2021-07-04T07:57:38.543Z	INFO	[api]	api/server.go:66	Stats endpoint (/usr/share/elastic-agent/state/data/tmp/elastic-agent.sock) finished: accept unix /usr/share/elastic-agent/state/data/tmp/elastic-agent.sock: use of closed network connection

Other info

When I using docker cli to start the container for testing, it work normally.


I working on this several day and now I feel really sleepy. If there are some info that I didn't put there pls mention that to me, thanks!

Any help would be greatly appreciated! :slight_smile:

Though my observation... I guess that there are two part that agent will do when it register to kibana:

  1. register to kibana & generate config
  2. restart
  3. start agent with new config

The problem with k8s may be:
When it restart, k8s will kill it and remove the config so the agent will register and again....

Solution:
add a volume to it?..

Confirm. It is the problem of volume and livenessprob need https :slight_smile: