Fleet server stuck upgrading 8.5.3 -> 8.6.0

Our fleet server runs on our kibana server, with about 20 agents connected.
We upgrade our Elastic cluster from 8.5.3 to 8.6.0 and the upgrade popped up in the fleet UI just fine. Selected the fleet server for the upgrade, and the status changed to upgrading; and its been sitting like that for four days now.
The agent is still running 8.5.3, and none of the logs show any reference to attempting to download the 8.6.0 code from the elastic repository. I have rebooted the kibana server, and the agent on its own. and it still just sits there running 8.5.3, and the UI status is updating.
Elastic-agent status gives this response:

 elastic-agent status
Status: HEALTHY
Message: (no message)
Applications:
  * fleet-server           (HEALTHY)
                           Running on default policy with Fleet Server integration
  * filebeat_monitoring    (HEALTHY)
                           Running
  * metricbeat_monitoring  (HEALTHY)
                           Running

Anyone got any ideas?

Ross

Hello,

It's hard to give an assertive answer without investigating the Elastic Agent and Fleet Server logs.

If the agent didn't update, I'd expect to have some error or warn message in the logs.

What you can do is to check the agent document on ES to see if there is a upgrade_started_at field, but upgraded_at is either missing or is null.

You can query the agent with:
GET .fleet-agents/_doc/AGENT_ID

You can also re-trigger the upgrade using the Kibana Fleet API:

curl --request POST \
  --url https://<KIBANA_HOST>/api/fleet/agents/<AGENT_ID>/upgrade \
  --user "<SUPERUSER_NAME>:<SUPERUSER_PASSWORD>" \
  --header 'Content-Type: application/json' \
  --header 'kbn-xsrf: as' \
  --data '{"version": "<VERSION>","force": true}'

Thanks for that. I re-triggered the update, and it seemed to work in that the fleet-server is now running 8.6.0, but it is unhealthy, and the logs are full of:

{"log.level":"error","@timestamp":"2023-01-19T03:50:21.594Z","message":"Error fetching data for metricset beat.state: error making http request: Get \"http://unix/state\": dial unix /opt/Elastic/Agent/data/tmp/fleet-server-default.sock: connect: no such file or directory","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"beat/metrics-monitoring","type":"beat/metrics"},"log.origin":{"file.line":256,"file.name":"module/wrapper.go"},"service.name":"metricbeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}

Looking in the data/tmp directory I can see that that socket does not exist so it seems the fleet agent is not starting properly, as seen here:

./elastic-agent status
State: DEGRADED
Message: 1 or more components/units in a failed state
Components:
  * fleet-server  (HEALTHY)
                  Healthy: communicating with pid '1700'
  * http/metrics  (HEALTHY)
                  Healthy: communicating with pid '1710'
  * filestream    (HEALTHY)
                  Healthy: communicating with pid '1719'
  * beat/metrics  (HEALTHY)
                  Healthy: communicating with pid '1729'

Unfortunately, the logs don't indicate any issues when starting up, or why the sock is not created.
Also, nothing is listening on 8220 so none of the agents can check in etc.

Hmm, just saw this in the logs:

{"log.level":"error","@timestamp":"2023-01-19T04:23:17.502Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":833},"message":"Unit state changed fleet-server-default (STARTING->FAILED): invalid log level; must be one of: trace, debug, info, warning, error accessing 'fleet.agent.logging'","component":{"id":"fleet-server-default","state":"HEALTHY"},"unit":{"id":"fleet-server-default","type":"output","state":"FAILED","old_state":"STARTING"},"ecs.version":"1.6.0"}

Any idea where this is set, and how I can fix it? Because its the fleet server, I can't use the fleet/kibana UI to do anything

Please use the command ./elastic-agent status --output=yaml that will povide more detail on the status of each unit. It will show which unit is in a failed state.

So here is that output, and it again points to an invalid logging setting somewhere. Is there someway from the command line or editing a file on the local server that this can be over-ridden? I have grepped the Agent directories and can't find where this is configured in a yaml file anywhere.

 elastic-agent status --output=yaml
info:
  id: a8470c55-c2d8-44e3-b922-ad371f18fcfe
  version: 8.6.0
  commit: b79a5db77b5d6ffab9855234f8371d9e53978a24
  build_time: 2023-01-04 22:53:22 +0000 UTC
  snapshot: false
state: 3
message: 1 or more components/units in a failed state
components:
- id: fleet-server-default
  name: fleet-server
  state: 2
  message: 'Healthy: communicating with pid ''2048'''
  units:
  - unit_id: fleet-server-default-fleet-server-fleet_server-d4d1e18a-2ff0-41ae-b9ee-7005baa4d068
    unit_type: 0
    state: 0
    message: waiting for output unit
  - unit_id: fleet-server-default
    unit_type: 1
    state: 4
    message: 'invalid log level; must be one of: trace, debug, info, warning, error
      accessing ''fleet.agent.logging'''
  version_info:
    name: fleet-server
    version: 8.6.0
    meta:
      build_time: 2023-01-04 19:26:24 +0000 UTC
      commit: 05088c13
- id: http/metrics-monitoring
  name: http/metrics
  state: 2
  message: 'Healthy: communicating with pid ''2060'''
  units:
  - unit_id: http/metrics-monitoring
    unit_type: 1
    state: 2
    message: Healthy
  - unit_id: http/metrics-monitoring-metrics-monitoring-agent
    unit_type: 0
    state: 2
    message: Healthy
  version_info:
    name: beat-v2-client
    version: 8.6.0
    meta:
      build_time: 2023-01-04 01:30:07 +0000 UTC
      commit: 561a3e1839f1a50ce832e8e114de399b2bee2542
- id: filestream-monitoring
  name: filestream
  state: 2
  message: 'Healthy: communicating with pid ''2070'''
  units:
  - unit_id: filestream-monitoring
    unit_type: 1
    state: 2
    message: Healthy
  - unit_id: filestream-monitoring-filestream-monitoring-agent
    unit_type: 0
    state: 2
    message: Healthy
  version_info:
    name: beat-v2-client
    version: 8.6.0
    meta:
      build_time: 2023-01-04 01:28:13 +0000 UTC
      commit: 561a3e1839f1a50ce832e8e114de399b2bee2542
- id: beat/metrics-monitoring
  name: beat/metrics
  state: 2
  message: 'Healthy: communicating with pid ''2081'''
  units:
  - unit_id: beat/metrics-monitoring-metrics-monitoring-beats
    unit_type: 0
    state: 2
    message: Healthy
  - unit_id: beat/metrics-monitoring
    unit_type: 1
    state: 2
    message: Healthy
  version_info:
    name: beat-v2-client
    version: 8.6.0
    meta:
      build_time: 2023-01-04 01:30:07 +0000 UTC
      commit: 561a3e1839f1a50ce832e8e114de399b2bee2542