Our fleet server runs on our kibana server, with about 20 agents connected.
We upgrade our Elastic cluster from 8.5.3 to 8.6.0 and the upgrade popped up in the fleet UI just fine. Selected the fleet server for the upgrade, and the status changed to upgrading; and its been sitting like that for four days now.
The agent is still running 8.5.3, and none of the logs show any reference to attempting to download the 8.6.0 code from the elastic repository. I have rebooted the kibana server, and the agent on its own. and it still just sits there running 8.5.3, and the UI status is updating.
Elastic-agent status gives this response:
elastic-agent status
Status: HEALTHY
Message: (no message)
Applications:
* fleet-server (HEALTHY)
Running on default policy with Fleet Server integration
* filebeat_monitoring (HEALTHY)
Running
* metricbeat_monitoring (HEALTHY)
Running
Thanks for that. I re-triggered the update, and it seemed to work in that the fleet-server is now running 8.6.0, but it is unhealthy, and the logs are full of:
{"log.level":"error","@timestamp":"2023-01-19T03:50:21.594Z","message":"Error fetching data for metricset beat.state: error making http request: Get \"http://unix/state\": dial unix /opt/Elastic/Agent/data/tmp/fleet-server-default.sock: connect: no such file or directory","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"beat/metrics-monitoring","type":"beat/metrics"},"log.origin":{"file.line":256,"file.name":"module/wrapper.go"},"service.name":"metricbeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}
Looking in the data/tmp directory I can see that that socket does not exist so it seems the fleet agent is not starting properly, as seen here:
./elastic-agent status
State: DEGRADED
Message: 1 or more components/units in a failed state
Components:
* fleet-server (HEALTHY)
Healthy: communicating with pid '1700'
* http/metrics (HEALTHY)
Healthy: communicating with pid '1710'
* filestream (HEALTHY)
Healthy: communicating with pid '1719'
* beat/metrics (HEALTHY)
Healthy: communicating with pid '1729'
Unfortunately, the logs don't indicate any issues when starting up, or why the sock is not created.
Also, nothing is listening on 8220 so none of the agents can check in etc.
{"log.level":"error","@timestamp":"2023-01-19T04:23:17.502Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":833},"message":"Unit state changed fleet-server-default (STARTING->FAILED): invalid log level; must be one of: trace, debug, info, warning, error accessing 'fleet.agent.logging'","component":{"id":"fleet-server-default","state":"HEALTHY"},"unit":{"id":"fleet-server-default","type":"output","state":"FAILED","old_state":"STARTING"},"ecs.version":"1.6.0"}
Any idea where this is set, and how I can fix it? Because its the fleet server, I can't use the fleet/kibana UI to do anything
Please use the command ./elastic-agent status --output=yaml that will povide more detail on the status of each unit. It will show which unit is in a failed state.
So here is that output, and it again points to an invalid logging setting somewhere. Is there someway from the command line or editing a file on the local server that this can be over-ridden? I have grepped the Agent directories and can't find where this is configured in a yaml file anywhere.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.