Okay yep
{"log.level":"warn","@timestamp":"2024-11-18T05:17:53.792Z","log.logger":"transport","log.origin":{"function":"github.com/elastic/elastic-agent-libs/transport/httpcommon.(*HTTPTransportSettings).RoundTripper.NetDialer.TestNetDialer.func3","file.name":"transport/tcp.go","file.line":53},"message":"DNS lookup failure \"fleet-server-agent-http.elastic-system.svc\": lookup fleet-server-agent-http.elastic-system.svc on 10.96.0.10:53: no such host","ecs.version":"1.6.0"}
There's the fleet enrollment error you were pointing to before.
My top guess at this point is that maybe the Elastic Agent was previously deployed with this as the fleet enrollment URL, it didn't work, and then you redeployed it with a correct URL?
Once Agent starts, it translates the env vars into the local agent configuration and then the env vars no longer impact the Agent configuration.
As a result, you may need to wipe the agent configuration from the volume (in this case):
- name: elastic-agent-state
mountPath: /usr/share/elastic-agent/state
...
- name: elastic-agent-state
hostPath:
path: /var/lib/elastic-agent-managed/kube-system/state
type: DirectoryOrCreate
You should try to remove the /var/lib/elastic-agent-managed/kube-system/state folder from the node as its mounted from the host or by execing into the pod and removing the /usr/share/elastic-agent/state folder and then restarting the agent.
There's an open issue here that tracks a request for allowing env var changes to cause agent re-enrollment.